search suggestion - Solr (Open Solr) suggester results contain punctuation marks -
i'm working on suggester , results i'm gettig contain punctuation. example, when type "volcan" get:
"volcanoes", "volcanic", "volcano", "volcano,", <- comma "volcanoes." <- period/full stop
here code in solrconfig.xml file:
<searchcomponent class="solr.spellcheckcomponent" name="suggest"> <lst name="spellchecker"> <str name="name">suggest</str> <str name="classname">org.apache.solr.spelling.suggest.suggester</str> <str name="lookupimpl">org.apache.solr.spelling.suggest.tst.tstlookup</str> <str name="field">text</str> <float name="threshold">0.005</float> <str name="buildoncommit">true</str> </lst> </searchcomponent> <requesthandler class="org.apache.solr.handler.component.searchhandler" name="/suggest"> <lst name="defaults"> <str name="echoparams">explicit</str> <str name="spellcheck">true</str> <str name="spellcheck.dictionary">suggest</str> <str name="spellcheck.onlymorepopular">true</str> <str name="spellcheck.count">5</str> <str name="spellcheck.collate">true</str> </lst> <lst name="invariants"> <!-- run suggester queries handler --> <str name="spellcheck">true</str> <!-- collate not needed, query if tokenized keyword, need suggestions term --> <str name="spellcheck.collate">false</str> </lst> <arr name="components"> <str>suggest</str> </arr> </requesthandler>
in schema.xml file have this:
<fieldtype name="spell" class="solr.textfield" positionincrementgap="100" indexed="true" stored="false" multivalued="true" termvectors="true" termpositions="true" termoffsets="true"> <analyzer type="index"> <tokenizer class="solr.standardtokenizerfactory"/> <filter class="solr.stopfilterfactory" ignorecase="true" words="stopwords.txt"/> <filter class="solr.standardfilterfactory"/> <filter class="solr.removeduplicatestokenfilterfactory"/> <filter class="solr.shinglefilterfactory" minshinglesize="2" maxshinglesize="4" outputunigrams="true" outputunigramsifnoshingles="true"/> <filter class="solr.lowercasefilterfactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.keywordtokenizerfactory"/> <filter class="solr.synonymfilterfactory" synonyms="synonyms.txt" ignorecase="true" expand="true"/> <filter class="solr.trimfilterfactory"/> <filter class="solr.stopfilterfactory" ignorecase="true" words="stopwords.txt"/> <filter class="solr.standardfilterfactory"/> <filter class="solr.removeduplicatestokenfilterfactory"/> <filter class="solr.lowercasefilterfactory"/> </analyzer> </fieldtype>
and result is:
{ "responseheader": { "status": 0, "qtime": 0, "params": { "wt": "json", "q": "volcan" } }, "spellcheck": { "suggestions": [ "volcan", { "numfound": 5, "startoffset": 0, "endoffset": 6, "suggestion": [ "volcanoes", "volcanic", "volcano", "volcano,", "volcanoes." ] } ] } }
the problem not on requesthandler... rather, seems reside in way you're indexing files go spell field, , maybe spell field it's self. i'm thinking should enable tokenizer strips out punctuation fields.
here's spell field definition works me in schema.xml
<fieldtype name="spell" class="solr.textfield" positionincrementgap="100"> <analyzer type="index"> <tokenizer class="solr.standardtokenizerfactory"/> <filter class="solr.stopfilterfactory" ignorecase="true" words="stopwords.txt"/> <filter class="solr.standardfilterfactory"/> <filter class="solr.removeduplicatestokenfilterfactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.standardtokenizerfactory"/> <filter class="solr.synonymfilterfactory" synonyms="synonyms.txt" ignorecase="true" expand="true"/> <filter class="solr.stopfilterfactory" ignorecase="true" words="stopwords.txt"/> <filter class="solr.standardfilterfactory"/> <filter class="solr.removeduplicatestokenfilterfactory"/> </analyzer> </fieldtype>
Comments
Post a Comment