TextSearch searches through a PDF document for a user-given search pattern.
The current implementation supports both verbatim search and the search
using regular expressions, whose detailed syntax can be found at:
http://www.boost.org/doc/libs/release/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html
TextSearch also provides users with several useful search modes and extra
information besides the found string that matches the pattern. TextSearch
can either keep running until a matched string is found or be set to return
periodically in order for the caller to perform any necessary updates
(e.g., UI updates). It is also worth mentioning that the search modes can be
changed on the fly while searching through a document.
Possible use case scenarios for TextSearch include:
- Guide users of a PDF viewer (e.g. implemented by PDFViewCtrl) to places
where they are intersted in;
- Find interested PDF documents which contain certain patterns;
- Extract interested information (e.g., credit card numbers) from a set of files;
- Extract Highlight information (refer to the Highlights class for details) from
files for external use.
- Since hyphens ('-') are frequently used in PDF documents to concatenate the two
broken pieces of a word at the end of a line, for example
"TextSearch is powerful for finding patterns in PDF files; yes, it is really pow-
erful."
a search for "powerful" should return both instances. However, not all end-of-line
hyphens are hyphens added to connect a broken word; some of them could be "real"
hyphens. In addition, an input search pattern may also contain hyphens that complicate
the situation. To tackle this problem, the following conventions are adopted:
- When in the verbatim search mode and the pattern contains no hyphen, a matching
string is returned if it is exactly the same or it contains end-of-line
or start-of-line hyphens. For example, as mentioned above, a search for "powerful"
would return both instances.
- When in verbatim search mode and the pattern contains one or multiple hyphens, a
matching string is returned only if the string matches the pattern exactly. For
example, a search for "pow-erful" will only return the second instance, and a search
for "power-ful" will return nothing.
- When searching using regular expressions, hyphens are not taken care implicitly.
Users should take care of it themselves. For example, in order to find both the
"powerful" instances, the input pattern can be "pow-{0,1}erful".
Examples
For a full sample, please take a look at the TextSearch sample project.
PDFDoc doc = new PDFDoc(filein);
doc.initSecurityHandler();
int mode = TextSearch.e_whole_word | TextSearch.e_page_stop;
UString pattern("joHn sMiTh");
TextSearch txt_search = new TextSearch();
doc.Lock();
txt_search.Begin( doc, pattern, mode, -1, -1 );
while ( true )
{
TextSearch.ResultCode result = txt_search.Run();
if ( result.GetCode() == TextSearchResult.e_found )
{
Console.WriteLine("found one instance: " + result.GetResultStr());
}
else
{
break;
}
}
doc.UnLock();
Inheritance Hierarchy
Namespace:
pdftron.PDF
Assembly:
pdftron (in pdftron.dll) Version: 255.255.255.255
Syntaxpublic sealed class TextSearch : IClosable
Public NotInheritable Class TextSearch
Implements IClosable
public ref class TextSearch sealed : IClosable
pdftron.PDF.TextSearch = function();
Type.createClass(
'pdftron.PDF.TextSearch',
null,
Windows.Foundation.IClosable);
The TextSearch type exposes the following members.
Constructors
Methods
| Name | Description |
---|
 | Begin | Initialize for the search process. This should be called before starting the actual search.
with method run().
|
 | Close | |
 | Equals | Determines whether the specified Object is equal to the current Object. (Inherited from Object.) |
 | GetCurrentPage | Retrieve the number of the current page that is searched in.
If the returned value is -1, it indicates the search process has not been initialized
(e.g., begin() is not called yet); if the returned value is 0, it indicates the search
process has finished, and if the returned value is positive, it is a valid page number.
|
 | GetHashCode | Serves as a hash function for a particular type. (Inherited from Object.) |
 | GetMode | Retrieve the current search mode. |
 | GetType | Gets the Type of the current instance. (Inherited from Object.) |
 | Run | Search the document and returns upon the following circumstances:
- Reached the end of the document
- Reached the end of a page (if set to return by specifying mode 'e_page_stop')
- Found an instance matching the search pattern
Remarks
Note that this method should be called in a loop in ordre to find all matching instances;
in other words, the search is conducted in an incremental fashion.
|
 | SetMode | Set the current search mode. For example, the following code turns on the regular
expression:
TextSearch ts = new TextSearch();
...
int mode = ts.getMode();
mode |= TextSearch.e_reg_expression;
ts.setMode(mode);
...
|
 | SetPattern | Sets the current search pattern. Note that it is not necessary to call this method since
the search pattern is already set when calling the begin() method. This method is provided
for users to change the search pattern while searching through a document.
|
 | SetRightToLeftLanguage |
Tells TextSearch that the document reads from right to left.
|
 | ToString | Returns a string that represents the current object. (Inherited from Object.) |
Top
See Also