Fulltext Search OpenProdoc


The Fulltext search allow to find document by its text besides it metadata or other criteria. This is possible by means of the Apache libraries Tika, that extract the text content from different file formats and Lucene, that analizes and index all the words of the text. The index are stored in a disk folder (for what it must be created an special repository) and are updated when documents are inserted, updated or deletd (for what it must be created some indexing tasks).

For using the fulltext searching a repository must be created (Repositories Maintenance) with the reserved name "PD_FTRep", of type filesystem (FS) and with an URL referencing a folder/fiilesystem that can be accesed by all the systems that index or search fulltext. (Ej."/prodoc/Ft_Index/"). No additional parameters are needed. It's important to note that the user(s) that do the indexing must have full permissions on the folder where the index are stored. Sometimes problems in indexing are solved assigning the permissions 7777 to the folder. If the installation uses only Web Client or Remote Conection, the J2EE server where the OpenProdoc application is installed must have read/wirte access to the Fulltext folders/filesystem.

Screenshot Repository FT Web

Screenshot Repository FT

Additionally, it must be created tasks linked to events (Tasks Event maintenance) for updating the fulltext index. It's necessary to create a task linked to of each kind of event (insert, Delete, Update) so each time a document is inserted, updated or deleted, a task for updating the Full Text indexes is created.

Screenshot Lista tareas FT Web

Screenshot Lista tareas FT

The easiest way, that will index all the documents in the complete repository, is to select as document type of the event the OpenProdoc base type "PD_DOCS" and as folder for filtering the root folder "/". That way the event will trigger for the documents of type PD_DOCS and all its subtypes (that is ALL types of documents) and the filter will for documents stored under folder root "/" and all its subfolders (that is ALL Folders in the repository).

If, by volumen or performance limitations, it's needed to limit the FT indexing to a set of document types, you can define a task the the document type common ancestor of all the document types selected o to define several packs of tasks (insert, update, delete), one for each document type to index. In a similar way, it's possible to filter the folders structure, so that the system index the structure "/Marketing" and all its subfolders and not the structure "/Private Information". As with the document types, you can select the common parent folder or to create several packs of 3 tasks (one for each folder).

The information to include in each task is:

The scheduled task will run depending on the defined frecuence and the list of pending task, so generally the documents will be not accesible fr full text search just after being inserted or updated..

You can search from the standard document search form (Search docs). The traditional search criteria by document types and metadata can be combined with full text expresions. The available operators are:

Indexing optimization selecting language and stopwords

In the full-text search, since version 2.3 of OpenProdoc, some improvements have been introduced, such as the possibility of choosing the language or being able to define a dictionary of stopwords, two measures that improve the quality of the searches results as well as performance.

Choosing a language activates the stemming for that language, that is the conversion from the words to their "root" before indexing. In this way, when searching, it is indifferent to enter "Document" or "Documents". Logically the stemming rules are different by language (for example in English the suffix "ing" of the gerunds will be eliminated). Therefore, the appropriate language must be chosen to match the language of documents to be searched. If the documents can be of several languages, it is possible to keep the language unspecified. It is generally not convenient to use a different language than the language of the documents, since the application of rules designed for another language may cause the quality of the results to decrease instead of increasing.

Regarding the dictionary of stopwords, includes words that are not significant for a search, either because they are "meaningless" particles (articles, prepositions, pronouns ...) or because they will appear in almost all documents (for example the word "ecology" in documentation of an environmental organization) and therefore the Search for those terms will return almost all documents, which does not add any value. The inclusion of words in the dictionary of empty words, on the one hand saves space in the files of search indexes by full text and provides more speed of search and indexing, and on the other hand facilitates the search, since their appearances are ignored and it focuses on the significant terms. For example, you can find documents where "pollutant discharges into the river" appear and "some pollutant has been spilled on the right bank of the river" if they are empty words: "the, in, the, has, the, margin, right ", since the terms associated with the document will be: dumping, contaminant, river (where stemming has also been used to remove plurals).

To choose the language or the list of empty words, the following procedure should be followed:

  1. Create a text file (.TXT) where all the empty words to be used are entered, each of them in a line. Lines that start with the # character will be ignored and can be used to include comments and explanations.
  2. The file must be incorporated into OpenProdoc, assigning it any type of document. After the insertion, the unique identifier of the document (PDId) must be saved. It is recommended to insert it in the "System" folder, although it is not essential.
  3. Subsequently, the full text repository (which has the name "PD_FTRep") must be modified in the list of repositories. The system detects that it is a full text repository and presents a "W" button that when pressed shows a form that allows editing the language (between the supporters by OpenProdoc: ES, EN, PT, CT) and enter the Identifier (PDid) of the document with the list of stop words.
  4. After saving the modifications, the program will use the new parameters for the following indexing and searching processes. The new configuration may take some time to be used (because the information is cached to increase performance). To force the use of the new configuration, it is best to restart the server.
  5. You can choose not to inform the language (by choosing the value "*") or not to enter a dictionary of empty words (leaving the identifier empty).
  6. The document with the stop words can be versioned, like any other OpenProdoc document. The program will use the latest version, although it may take some time to update it (because the information is cached to increase performance). To force the use of the new version, it is best to restart the server.

Help Index OpenProdoc