Apache Tika – Tike extracts and tokenizes text from 1400 file formats
Tike extracts and tokenizes text from 1400 file formats, like .doc, .pdf, .html, etc.