Tuesday, February 19, 2013

Solr - HTMLStripCharFilter...

I am attempting to store a bit of data that I fetch from a website in Solr.  The data sometimes has HTML markup, so I decided to use the HTMLStripCharFilterFactory in the fields analyzer.

Here is an example of the field type that I created:

<fieldType name="strippedHtml" class="solr.TextField">
   <analyzer>
      <charFilter class="solr.HTMLStripCharFilterFactory"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>      
      <filter class="solr.LowerCaseFilterFactory" />
   </analyzer>
</fieldType>


I used the field type of strippedHtml in a field called itemDescription, and when I do a search after indexing some data I can see that the itemDescription contains data that still has HTML markup.  I used the analyzer tab in Solr to see what would happen on index of HTML data, and I could see that none of the markup appears to be stripped out.

It turns out that most of the HTML was encoded so that the angle bars are replaced with the escaped values.  I will need to find a way to remove the escaped values.

No comments:

Post a Comment