Lesson 3: Accessing Pages
| filename: | lesson03-access_pages.py |
| getName() | Lesson 03 - Page Access |
| getDescription() | Lesson 03 - Read the contents of pages within a book. |
Now it's starting to get interesting - we are about to start reading the contents of a page. Reading the page opens up a lot
of new possibilities for your plugins. Here are a few examples.
A Note On Using Functions
In this lesson, we're going to do a few different tasks. Rather than put all tasks into the execute function, which might make it too big, we'll break the code up into functions, and simply call them from the execute function. You can do this with your own plugins - it is often necessary once you start writing more complex plugins.
Here is an example of calling a function. In the 'execute' function, which we know does the actual work, we simply put this code:
outputFile = open( "C:/lesson03.txt", "w" ) outputFile.write( "Lesson 03\n\n" ) doLesson3( outputFile ) outputFile.close()
Now we are moving all of the actual work to a new function, 'doLesson3'. That function looks like this:
def doLesson3( outputFile ): numBooks = theAppData.library.getNumBooks() for i in range( numBooks ): book = theAppData.library.getBook( i ) writeBookHeader( outputFile, book ) writePageSummary( outputFile, book )
The only thing to note, is that you have to pass around parameters (such as the outputFile), to make sure that they are available
to other parameters. An alternative would be to have your plugin define global variables, but we tend to avoid that.
List Page Names
Last lesson, we briefly met the WikiBook class. Once we obtained a WikiBook object, we used a method called getDisplayName to get the name of that book. WikiBook has some other very useful methods, illustrated with the following line from our lesson
3 plugin:
numPages = book.getNumPages() for i in range( numPages ): pageName = book.getPageName( i )
Does that look familiar? It's actually very similar to how we obtained the book names in the previous lesson. First, we find
out how many objects exist; then we iterate over those objects.
Read Page Contents
To read the contents of a page, you need to obtain or create a WikiPage object representing that page. Naturally, WikiBook has a method which does exactly that:
page = book.getPage( i )
'getPage' returns a WikiPage object. Once you have that, you can then query the page itself. We do this in the function 'searchPageForText':
rawText = page .getText()
'getText()' returns what we call raw text, which means: 'what you would write in Edit Mode'. Note that getText() can return a very large string, depending on the size of your page.
There is also a way to access the HTML which Note Studio displays in View Mode, but that is an advanced topic.
Parsing Page Contents
Now, if you're not too familiar with Python's powerful libraries, you might want to start writing your own code to handle
this page text. Our advice: be careful not to reinvent the wheel. Very often, you'll find that Python already has exactly
the sort of routines you're looking for - look for Python functions to do what you want.
To demonstrate, we'll look at a few ways we can handle page text, using standard Python library functions.
Search for Text on Page
There's no need to write any code for string matching. It's all in Python's 'string' module. Simply import that module at the top:
import string
Then use its built-in string-search routine:
if string.find( rawText, "wookie" ) != -1:
Python modules are documented in the Python documentation, available online at: http://docs.python.org.
If you're the type of person who is comfortable with regular expressions, then you can do infinitely more powerful searches using Python's built-in 're' module.
Accessing Page Contents Line-By-Line
Python has a StringIO module, which allows you to handle a string as though it was a file stream. This is good news, because
files have a number of built-in methods which can be useful. For example, it has a 'readlines' method, which breaks a string into a list of lines. You can access each line individually. No need for your own line-breaking
routine - let Python do it. Let's count how many lines on each page contain the word 'wookie'.
First, you need to import the relevant Python module, StringIO. By convention, we do it at the top of the file:
import StringIO
Then, we can use the module later on:
stringStream = StringIO.StringIO( rawText ) pageLines = stringStream.readlines() count = 0 for line in pageLines: if string.find( line, "wookie" ) != -1: count = count + 1 outputFile.write( "found it on %d different lines\n"%(count) )
Here the main interest is in the first two lines. First, we construct the 'StringIO' object by just passing it the page text. Then we call 'readlines()' which returns a list of individual lines from the page. Note that 'readlines', a built-in function, does all the work. You can then iterate through the resulting list, to access each individual line.
In this tutorial, we simply do a string search on each line. But you could any line-by-line operation with this approach.
Search for Note Studio markup
If, for some reason, you want to know something about the markup on a particular page, then Note Studio exposes some functions
which can help you. To give a pretty weak example, if you wanted to find whether a particular page contained bold text, please
don't do your own parsing, searching for asterisk characters, etc. Use the intermediate formatted text as described in the advanced tutorial Lesson A2: Raw, Formatted, and Styled Text
Conclusion:
Now you can find individual pages, and examine their contents. Although we're only up to lesson 3, this is already extremely
powerful. You have enough information to start trawling through books, producing reports, looking for specific things - all
sorts of things!
Next, we'll look at how you can modify books and pages, to automatically produce your own Note Studio content.