************************************************************************ * Orca Search v2.3 * * A robust auto-spidering search engine for single/multiple sites * * Copyright (C) 2006 GreyWyvern * * * * This program is free software; you can redistribute it and/or modify * * it under the terms of the GNU General Public License as published by * * the Free Software Foundation; either version 2 of the License, or * * (at your option) any later version. * * * * This program is distributed in the hope that it will be useful, * * but WITHOUT ANY WARRANTY; without even the implied warranty of * * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * * GNU General Public License for more details. * * * * You should have received a copy of the GNU General Public License * * along with this program; if not, write to the Free Software * * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 * * USA * ************************************************************************ *** Changelog - See "changelog.txt" for this script's complete revision history *** Quick Start - See "quickstart.txt" for short installation instructions suitable for advanced users. *** Upgrading - See "upgrade.txt" for step-by-step instructions to guide you in successfully upgrading from a previous version *** FAQ - See "faq.txt" for this script's FAQ *** Contents 1. Script Requirements 2. Introduction 3. Installation i. Ensure you have all files ii. Edit config.ini.php iii. Upload the files iv. Activate the Control Panel 4. Spider Configuration i. Logging in and setting up ii. Starting URIs iii. Choosing a Trigger iv. Automatic Categorisation v. URI Matching vi. Remove Title Strings vii. Remove Elements viii. Starting the engine ix. Additional filetype plugins 5. Entry List Panel i. Filtering and Sorting ii. Status types iii. New entries iv. The Action dropdown menu v. Data Locking 6. Searching i. Output format ii. Customisation iii. Standalone search boxes 7. Search Options i. The Basics ii. URI Matching iii. Latin Accent Matching iv. Adjusting Match Relevance v. Miscellaneous 8. Crontab Spidering 9. Sitemap Extension 10. JWriter Extension ************************************************************************ ************************************************************************ 1. Script Requirements PHP 4.2.0+ (4.3.0+ recommended) MySQL 3.23+ ************************************************************************ ************************************************************************ 2. Introduction Welcome to the Orca Search script. This script will index the pages at a single domain, or group of specified domains by spidering the contents at an interval you specify. You may even set up the spider to be run at a specific time via *NIX cron tab. Note: The Orca Search is a full-text search, rather than an index-word search; this provides a few key benefits such as: complete language independence, fast spidering and highly accurate searching. However, as a result of this design, the script will only usefully scale to sites with 1,000 ~ 2,000 pages. Decent spider crawl and search return times have been reported for indexes with up to 10,000 pages, but such results are not typical. What follows are detailed instructions to help you make full use of this script. If you are an advanced user and don't need to hear a rehash of uploading, CHMODing and editing text files, you may use the Quick Start guides in the quickstart.txt file. It contains short, step- by-step instructions describing how to set up the script itself, and the JWriter and Sitemap extensions. For everyone else, I recommend reading this entire readme.txt file to make sure you don't miss out on features you might have just skimmed over. It's long, but it's worth it. Besides, I spent so much time writing it! :) Don't let me down, eh? Are you ready? Once the script is up and running, you'll need to teach your spider what to eat and what to leave alone. This will take some tweaking for days or weeks to come, but eventually, what you'll end up with is an automatic self-updating search system you never have to think about again! Well, maybe not "never", but pretty close :) ***** Please report any bugs to: wyvern@greywyvern.com ***** ************************************************************************ ************************************************************************ 3. Installation i. Ensure you have all files The Orca Search v2.0 attempts to be completely modular, unlike the other scripts in the Orca series. Most pieces of the script are 100% swappable with newer or modified versions, should they become available. Each file has a specific function which can be built upon by future modders. As such, there are three types of files for this script: Core, Output and Tools. So let's run through all the files you got in your Orca Search package: There are seven (7) Core files. These files MUST be installed in order for the basic search script to work: config.ini.php - User variables file config.php - Global configuration file control.css - CSS for Control Panel control.php - The Control Panel head.php - Accepts search requests and builds an array of results lang.txt - Language file for spider and control panel spider.php - Crawls your site and indexes pages The default language file is English. To use a different language file, if available, name the file "lang.txt" and overwrite the existing text file. You may need to adjust the Control Panel Display Charset as per instructions in the language file you are now using (explained later). The control panel script will always look for a file named "lang.txt" in the same directory in which the control.php file resides. There are four (4) Output files. Three of them, "body.xhtml.php", "body.xhtml.css" and "body.xhtml.lang.txt" make a set, while "body.rss20.php" stands alone. Only one set of these needs to be installed depending on what type of output you want your search script to generate: body.xhtml.php -\ body.xhtml.css --}- Generates XHTML output from a result array body.xhtml.lang.txt -/ body.rss20.php - Generates an RSS 2.0 feed of search results There are four (4) Tools files. The first three files are for a tool called the JWriter. This tool will take the data from your search database and compress it into a javascript file which can be used to search a site which has been mirrored for offline use by a program such as HTTrack. I've designed the JWriter mainly using output from this mirroring program so I highly recommend it: The JWriter tool uses these files: egg.js - Offline Javascript target file jwriter.php - JWriter workhorse _search.html - Sample offline search page The fourth file is the target file for Sitemap writing: sitemap.xml - Empty Sitemap file Google is currently beta testing a sitemap service, where you can submit an XML or gzipped XML list of pages at the site you want indexed. The script will output this XML data into the sitemap.xml file. There is also the option to gzip this data to save space and bandwidth. If you select this option, you will need to rename this file with an .xml.gz extension. Among all of these files there is a small file named "_search.php". This file shows a sample XHTML search result page (using the "body.xhtml.php" output file) and where each piece of the search engine is included. Following the same system of PHP includes, you should be able to embed the search engine into your already existing webpage layout. The .zip file also includes an .htaccess file which turns off your server's zlib output compression within the os2/ directory. The spider's progressive output requires delivery of the page to the browser in plain HTML. You don't need to upload this file if you are having no problems viewing the spider's incremental output, when triggered from the Control Panel. Servers usually have zlib compression turned off by default. ii. Edit config.ini.php Before uploading, open the "config.ini.php" file. There is a short list of variables you need to assign manually in order for the script to work on your server. First, there are five variables under the MySQL header. These variables allow the script to access the MySQL database system on your server to manage script data and store search indices. If you don't know these variables offhand, ask your host. The first four variables will be specific to your server and MySQL installation. However, the fifth variable will be used as a prefix for creating three tables in your database. You can give this variable any name you want, as long as it is only letters and numbers and does not begin with a number. Next there are two variables under the Admin header. These will be the login name and password for the Control Panel. Change them to something hard to guess! Once everything is the way you like it, you can save and close the "config.ini.php" file. iii. Upload the files Create a directory in your public HTTP area to contain the search script files. The default is "os2" but you can specify any directory you want, provided you change the include statements in any search page you create, and in appropriate places in the Control Panel, to point to the new directory. ******************************** NOTE ******************************** * For the purpose of convenience, the remainder of this manual will * * assume you are using the default "os2" directory and the XHTML * * Output files! * ********************************************************************** Upload the following Core files into the "os2" directory: config.ini.php config.php control.css control.php head.php lang.txt spider.php Then upload the following Output files into the "os2" directory: body.xhtml.php body.xhtml.lang.txt body.xhtml.css Finally, upload the following file into the parent directory of "os2": _search.php After you have uploaded all the files, your directory structure should look like this: /_search.php /os2/body.xhtml.php /os2/body.xhtml.lang.txt /os2/body.xhtml.css /os2/config.ini.php /os2/config.php /os2/control.css /os2/control.php /os2/head.php /os2/lang.txt /os2/spider.php iv. Activate the Control Panel When the above files have been uploaded, visit "os2/control.php" via HTTP with your web browser. If you are prompted with a login screen, script setup was a success! The Control Panel is now installed and ready to be configured. ************************************************************************ ************************************************************************ 4. Spider Configuration i. Logging in and setting up Log in using the username and password you entered in the "config.ini.php" file. Once you log in, you'll be confronted by the Spider setup area. Scroll down and set all your spidering options, they should be well explained by the help text included in the form. Some of the more obscure form elements will be explained below. Make sure you "Submit" to save your changes! If you have trouble entering or viewing special characters in any of the fields, the problem may be that the Control Panel is not being served in the character set of your input. If this is the case, click the Tools button in the menu and change the Display Charset to your preference. If you load a different language file than the default (English), make sure to check which character set(s) it is compatible with. The Control Panel Display Charset *must* be changed to match the language file or else dialogues may not display properly. ii. Starting URIs Choosing a good starting URI (you may also specify multiple starting URIs) is important. You want to choose a URI which contains links to as many other pages on your site as possible. Usually this is the home page or some sort of sitemap page. However, note that the spider will not travel up in the directory tree. So if you start your spider in a deeper directory, links to pages in directories above it will be discarded. See these examples: http://www.example.com/ - The spider is free to use links to any directory at example.com http://www.example.com/~user/ - The spider can only use links which stay in the /~user/ directory http://www.example.com/~user/ http://www.example.com/ - The spider will use both locations as starting URIs and because of the second URI, the spider can use links to any directory Keep in mind that links which are found within any domain will only be followed if that domain is within your Allowed Domains list, further down the Spider Panel. iii. Choosing a Trigger You can choose one of two ways to trigger your spider. By default, the Orca Search uses an internal interval triggered by use of the search interface. In simple terms, the script keeps track of the last time it spidered; when someone uses your search engine, the script checks to see if a specified amount of time has passed; if so, a spider is triggered as the search finishes. While this is a simple means for causing a recurring spider, it also introduces a time-creep into the schedule. Since the trigger is determined by search interface use, which could happen at any time, the effective interval will always be some value slightly above the interval you specify. If you require your spider to run *only* when you want it to - perhaps to quarantine it to a time of day when server load is the lowest - you can use the Crontab Trigger option. See section 8 of this readme file for more information on this topic. Starting in version 2.2, the Orca Search has an option called Seamless Spidering, which is on by default. Formerly, when a spider was triggered (either by internal interval or crontab), the search index would lock and the search interface could not be used until the crawl was completed. This was an especially serious problem if your crawls took a long time. With Seamless Spidering enabled, the spider will make a copy of the index table to work with while the original index table remains open to searching. When the crawl is completed successfully, the original index table is overwritten by the updated table. Because a copy of the index table is used, you must have enough MySQL storage space to hold TWO complete indexes simultaneously. Unless your index is extremely large, this shouldn't be a problem, but contact your host if you are at all concerned, or if spiders don't seem to finish properly if Seamless Spidering is enabled. iv. Automatic Categorisation There is a special textarea in the spider options labelled "Automatic Categorisation". You can use this field to make the spider automatically assign certain categories to newly found pages. The field uses this special syntax: CategoryName:::URIMatch or CategoryName;;;TitleMatch Type the name of the category you want automatically assigned first, then choose whether a URI or Title string match will work best. If URI, add three colons, if Title, add three semi-colons. After the colons/ semi-colons, type a plain text matching string which will trigger the assignment of this category. These matches will be compared to the entire title or entire URI (including the "http://"). Here are some examples: a) Products:::products/ This rule will match these example URIs: http://www.example.com/products/ http://www.example.com/products/item1.php http://www.example.com/donotindex/products/item1.php If any of these URIs are found, they will automatically be assigned to the "Products" category. b) My Blog;;;My Blog This rule will match these example Titles: My Blog My Blog - A Day At The Farm Add a Comment to My Blog If any of these Titles are found, they will automatically be assigned to the "My Blog" category. Because of the three colons and semi-colons system being used, you cannot assign category names which contain these character sequences. However, the match string may contain them, if needed. The results of the spider can also be emailed to you. The Email Results field accepts email addresses in the same format as PHP's mail() function. Read more about it here: http://php.net/manual/en/function.mail.php v. URI Matching A good way to limit where the spider goes is by using the URI Matches textareas. By limiting what URIs the spider can request, you'll save valuable CPU cycles and data transfer during each spider. The first textarea is labelled "Require URI Matches" and is the more powerful of the two. Any scoured URI found by the spider which does not match at least one of the lines you enter here will be ignored. For instance, you can limit the spider to pages named "blog.php" with this rule: /blog.php Only URIs which contain that text will be requested and indexed. Think of this list as a "strict whitelist". The second textarea can be considered a "blacklist" and is called "Ignore URI Matches". Any URI the spider finds which matches at least one of the lines given here will be ignored without even requesting it from any server. So, say you have both index.html and index.php pages on your site, but since one redirects to the other, you don't want the spider to request the same page twice. Just add this line here: /index.html It is important to note that these matching lines, from both sections, are compared against the *entire* URI, including the domain name and even the "http://". In the case above, the line would also match these URIs: http://www.example.com/index.html http://www.example.com/directory1/index.html http://index.html.com/default.htm Mistakes like the third URI are possible without some double checking, so make sure all matches you input provide the least amount of possible error. vi. Remove Title Strings On some dynamic sites, each page is generated with a standard title prefix or suffix. For example, your pages might all have titles like: "My Company Inc. - [page title]" After a straight index, when a user does a search using the interface, all the search results will begin with: "1. My Company Inc. - Widgets" "2. My Company Inc. - Cleaning Your Widget" "3. My Company Inc. - Widgets For All" Obvously this redundancy doesn't make any sense, as well as looking pretty unhelpful. Using the Remove Title Strings textarea, you can specify strings of plain text which will be stripped from each title, leaving your search result titles short and to-the-point. So if you insert the string "My Company Inc. - " as a line in this textarea, after spidering your search result titles will now look like: "1. Widgets" "2. Cleaning Your Widget" "3. Widgets For All" vii. Remove Elements Sometimes you will have pages which you would like to index, but you'd also like to exclude some content on these pages. The content may be redundant, better explained elsewhere or simply not useful as a search result. What is needed is a way to remove certain HTML elements along with all their contents before the page is indexed. You can do this with the Remove Elements text area. Rather than implement a proprietary exclusion method, the Orca Search uses a small subset of the tried-and-true CSS selector model. There are five types of element you can exclude by adding entries here, each entry separated by spaces. a) element This is the basic, plain exclusion rule. All elements named will be removed from the source along with their contents, before the page data is indexed. Some common elements are included in the Remove Elements text area by default. b) element#id This is an element specific exclusion rule and works just like the corresponding CSS selector. The element named , which also has its id attribute set to "id", will be removed from the source along with its contents. c) element.class The same as b) except this rule matches elements with class attributes of "class". Multiple elements can have the same class and single elements can have more than one class (eg. class="class1 class2"). The rule above works in both situations. d) #id This is a non-specific exclusion rule. An element of any name which has an id of "id" will be removed from the source along with its contents. e) .class The same as d) except matching a class attribute rather than id. Once again, this only matches single classes. An example rule of this type (".noindex") is included in the Remove Elements text area by default. viii. Starting the engine After the spider has been found and you have set all your desired Spider options, hit the "Go" button in the form up top to begin the spider. Then watch it crawl! Because crawling a site requires a lot of error tolerance, if anything goes wrong with this search script, it will probably happen now. If an error does happen, the spider will stop and a message will be displayed. As I aim to make this script work with as many different PHP installations and URI formats as possible, if you could email me any error messages I would be very grateful :) As we move on, I will assume the spider completed its crawl sucessfully. Unless you were really meticulous, you'll probably notice that the spider ate a lot more or a lot less than you were expecting it to. This is normal. Just look through the list of files the spider ate, adjust your rules on the Spider page, and try again. You don't have to get it perfect just yet though, since you can make manual edits using the Entry List section. Remember, that if the spider crawled and indexed some pages it wasn't supposed to, those pages will not be purged if you add a blocking rule and spider again. They will only be marked as "Blocked" or "Unread". You will need to manually delete them from the Entry List panel. ix. Additional filetype plugins Spider plugins are ways you can extend the function of the spider by adding new file types and making a few simple changes to the script. These plugins are usually small php files which handle the spider output for certain MIME-types. These php files should be placed in the os2/plugins/ directory; create this directory if you haven't already. When indexing new file types, often the means for extracting the text requires an executable on your server to run against an actual file. Because data downloaded from the internet isn't really a file yet, the script needs a temporary directory to which it can upload files. In your os2/ directory (or wherever else you put the script) create a child directory called "temp". The Orca Search package comes with this empty folder by default, so you may have uploaded it already. See each individual plugin's help text file for their various installation instructions. To finalize each installation, you will need to include the file in your config.ini.php file. An example plugin include line looks like this: include "plugins/index.pdf.php"; ************************************************************************ ************************************************************************ 5. Entry List Panel Click the "Entry List" button in the menu to go to the Entry List. Here you'll find a big list of every page your spider has indexed. You can go through and make any edits you want, like adding custom keywords, titles and descriptions to entries, changing their category and even manually unlisting and/or deleting them. By default, the Entry List lists 100 entries per page. To change this, use the text field above the column containing the "Edit" buttons. Your selection will be remembered until you change it, even if you log out. Get used to using this interface, as it will become your main interface for managing the pages crawled by your spider. If you have any suggestions for making it easier to use, or for new features, be sure to let me know :) i. Filtering and Sorting It's important to note right away, that even if you have many thousands of pages listed, you can use the various filters in the Filters row to narrow down your list. Make it a habit to experiment with the different filters and using them together for very powerful matching. The Entry List has four main columns: Title/URI; Category; Status; and Edit/Sitemap. You can sort the list based on Title, URI and Category; just use the links along the top. Initially, you will see each entry's URI, Category and Status listed. ii. Status types Each entry falls into one of five basic status types: OK - Page was successfully found during a normal spider, or was indexed successfully via respidering. Orphan - Page was not linked to from any other page within your specified Allowed Domains or the Spider depth was not deep enough to reach it. Added - Page was recently added manually and will not appear in search results until it is spidered Blocked - Page was at least one of: - blocked via robots.txt - blocked by a user-defined Ignore URI rule - socket error while requesting - URI was HTTP redirected elsewhere - contained a tag redirect - unnacceptable MIME-type Not Found - This page used to be indexed but can no longer be found; no forwarding address was given. In addition to these status types, "OK" and "Orphan" pages can be either "Indexed" or "Unread" (all "Added", "Blocked" and "Not Found" pages are Unread by definition). When a page is "Indexed", it contains searchable body text and other meta information like a title, keywords and a description. Conversely, when a page is "Unread" it either contained no indexable text, or was not an indexable MIME-type. OK and Orphan pages which are Indexed will have their status listed in a normal font, while Unread pages will appear in strike-through. Finally, superceding all of these is the "Unlisted" status, which prevents the page from appearing in search results no matter what status it may have. A page becomes Unlisted either because of matching Search Panel rules or because it has been specifically set as "Unlisted" using the Entry List Panel. iii. New entries If you only ran the spider once before coming to the List page, you will notice that all of the entry URI's down the left-hand of the page are listed in boldface. After each spider, new pages, or existing pages with updated content are marked in bold. You can filter the list to view only pages updated since the last spider by checking the "New" checkbox next to the "Filters" title and hitting the "Set" button. If there are no "New" pages in your list, this checkbox will be disabled. iv. The Action dropdown menu Under the Filters toolbar, there is an Action dropdown. Using this dropdown in combination with the checkboxes along the left-hand side of the list, you can perform many different actions on single or multiple pages automatically. There are many options here useful for changing the attributes of many entries at once such as unlisting and relisting, changing category and even respidering. v. Data Locking By default the Orca Spider has a very rigid system of updating entry information. If the spider finds text which can be interpreted as a "title", it will overwrite the entry's current title; if it cannot find any such text, the existing title will be *retained* rather than deleted. The same system goes for Keywords and Description. In this fashion, you can easily include custom titles, keywords and descriptions for pages which do not natively contain this information. In certain cases you may want to assign custom title, description or keyword data to an entry which *does* contain this information already. By default, your custom information will be overwritten by the spider the next time it runs. You can prevent the spider from overwriting these three items by putting a "Data Lock" on the entry. You can do this either through the Action dropdown menu, or by clicking the Edit button for any entry and checking the Data Lock option. With this option enabled, the body text of this entry will still be updated, but the title, description and keywords will not. This prevents your custom changes from being overwritten. Entries which are Data Locked will display a copyright symbol next to their Title/URI. ************************************************************************ ************************************************************************ 6. Searching i. Output format Now that you have indexed pages at your site, visit the "_search.php" page with your web browser. You should find a search input form waiting to be used. If you added more categories than just "Main" using the Entry List page, the ability to filter results by category will appear via a handy dropdown box. A feature of the Orca Search is that the body.xhtml.php file, which displays the output, can be modified however you wish to fit your own website style. It takes all of the output from the "head.php" file, interprets it, and displays it in a logical form. If you want to make your own output file(s), examine "head.php" for a description of the PHP variables that file creates. It is possible to create practically any output format from the "head.php" output, even archiving it straight into another database! All that's needed is an appropriate body file to crunch the output. ii. Customisation The "_search.php" page is just a sample php page with the bare minimum amount of PHP code and HTML to display the search output. This is by design to make it easy for you to embed the output into your own existing website design. There are four important steps to take when embedding the search page. Open the "_search.php" in your text editor and we'll go through them now. a) First you should see that there are two included files right at the top of the page: "os2/config.php" and "os2/head.php". These files set up the environment and handle the search process, so they must be included in the results page *before* any HTML output. This means before the tag, and before any other whitespace. Otherwise you will get many "Headers could not be sent" errors. b) After this is a very important element which declares the charset of the search page. You definitely want search requests entered on this page to be in the *same* character encoding as the data you spidered or else results may not display properly. By using the $vData['c.charset'] variable, the charset is filled in with the Display Charset of the control panel by default. If you require the search results to display with a *different* character encoding than the Control Panel, change it in the element here. c) Finally there is the actual output include itself. In this case it loads "os2/body.xhtml.php" and will display all the output associated with each search request. Place this include where the content usually goes in your website layout. That's it! Any other PHP or HTML code is entirely up to you! iii. Standalone search boxes If you want to add a searchbox elsewhere on your site, just use one of these sample bits of HTML code: a) Search box with submit button:
... replace _search.php with the name of your search page. b) Without submit button (press enter to submit):
... again replacing _search.php with the name of your search page. To preselect a category for any search box, just add the following element to the form, replacing categoryname with the name of your desired category: _OR_ you can add the category drop down menu from the search output page. The category drop-down menu appears there only if you have more than one category, and is dynamically generated depending on what categories you have. If you'd like to include this drop down in a search box on another page, just copy the