SeqCrawler -

Installation

Debian package

A debian packagee is available with all software dependencies and ready to run. Only a few network parameters needs customization (hostname, shards...) after installation.

For multiple shards, install the package on multiple servers then update configuration for sharding (see Solr_configuration).

An administrator guide is available in the documentation section with installation steps and usage on Desktop.

Administrator guide

Manual installation

Requirements

Linux/Unix system (validated on RH5 and Ubuntu 9.10)
Tomcat6
JDK 1.6+
Apache2
Riak or MongoDB (default) for backend storage

Optional

GBrowse2 must be installed for genome browsing.
BioPerl 1.6.1 (For GBrowse2 and genbank2gff3.pl)

 
 To install GBrowse2 and BioPerl you can use the following script. This is a helper, it is however advised to refer to appropriate documentation.

perl -MCPAN -e 'o conf commit'
perl -MCPAN -e 'install Module::Build'
perl -MCPAN -e 'install IO::String'
perl -MCPAN -e 'install CJFIELDS/BioPerl-1.6.1.tar.gz'
perl -MCPAN -e 'install Bio::SeqIO::entrezgene'
perl -MCPAN -e 'install GD::SVG'
perl -MCPAN -e 'install Bio::Graphics'
perl -MCPAN -e 'install JSON'
perl -MCPAN -e 'install LWP'
perl -MCPAN -e 'install Storable'
perl -MCPAN -e 'install Capture::Tiny'
perl -MCPAN -e 'install File::Temp'
perl -MCPAN -e 'install Digest::MD5'
perl -MCPAN -e 'install CGI::Session'
perl -MCPAN -e 'install Statistics::Descriptive'
perl -MCPAN -e 'install DBI'
perl -MCPAN -e 'install DBD::mysql'
perl -MCPAN -e 'install DBD::Pg'
perl -MCPAN -e 'install DB_File::Lock'
perl -MCPAN -e 'install File::NFSLock'
perl -MCPAN -e 'install Template'
perl -MCPAN -e 'install Archive::Zip'
perl -MCPAN -e 'install Module::Build'
perl -MCPAN -e 'install Date::Parse'
perl -MCPAN -e 'install Math::BigInt'
perl -MCPAN -e 'install Math::BigInt::FastCalc'
perl -MCPAN -e 'install Math::BigInt::GMP'
perl -MCPAN -e 'install Net::OpenID::Consumer'
perl -MCPAN -e 'install Bio::Graphics::Browser2'

Additional Perl packages:

'JSON' => 2,21,
'LWP::Simple' => 1.41,
'URI::Escape' => 3.28,
'MongoDB' for MongoDB if used => 0.40

Installation steps

Available packages

For an install from scratch, several packages are available:

CrawlerIndex: package to index some biological data locally with an embedded server or remotly to an index server with a configured schema.
CrawlerSearchWebApp: provide a search interface and an export function to the search engine
GBrowse2 library: DBI interface between the GBrowse2 application and the search engine

All packages are provided ready for install, without compilation need.

Installation instructions

All installation should be hidden by an Apache frontend. This will hide the multiple ports used (Tomcat, Riak...) and ease access for people behind firewalls and allow easier securization with filters.

Apache configuration should restrict write operations from behind network for storage host/virtual host

Example configuration:

<LIMIT PUT DELETE>
order deny,allow
deny from all
allow from 192.168.1.*  // to allow from internal network only
</LIMIT>

Solr install:

Download and unzip apache-solr-3.3.0.zip in a directory (let's say /opt)

In newly created directory, create a new directory "seqcrawler" and copy the solr directory in this one. You will have for example:

/opt/apache-solr-3.3.0/seqcrawler/solr
                                      - /bin
                                      - /conf
                                      - /data
                                      - /dataset
                                      - /lib

In Tomcat instance (or other Servlet container), deploy the apache-solr-3.3.0.war (in dist directory of apache-solr). For Tomcat, copy it to the webapps directory.

In CrawlerSearchWebApp web application: update lookup.jsp String AUTHDIR ="/";. AUTHDIR is a directory filter to restrict file access to a defined sub directory structure (to restrict file access)

In seqcrawler-conf.js, the following lines must be configured:

 
$pageCountResults = 20;  // Number fo results per page
$solrUrl = "http://localhost/solr/select?rows="+$pageCountResults+"&qt=shard&wt=json&";  // Update host name
$solrFacet ="facet=on&facet.field=bank&";  // Update to add fields to facet (one facet.field per field)
$storageurl = "http://localhost"; // Update host
$gburl = "http://localhost"; // Update host
$gbbank = "seqcrawler"; // Update to GBrowse configuration name.

Copy the readseq.jar to the WEB-INF/lib to support Apache Tika indexation.

That's it, Solr instance is ready. Now we can to configure it.

Solr configuration:

Two main files require configuration. For "expert" configuration, please refer to the Solr documentation.

conf/solrconfig.xml:
solrconfig file is the main configuration of the index server, following needs to be updated:
- dataDir$solr.data.dir:./solr/data/dataDir : dataDir point to the directory containing the index. It can be a relative path.
- If index sharding is used (split index in smaller parts on different servers), the following must be updated:
```
   
   <requestHandler name="shard" class="solr.SearchHandler">
    
     <lst name="defaults">
      <str name="echoParams">explicit</str>
      
      <str name="shards">shardserver1/solr/,shardserver2/solr/,...</str>
     </lst>
  </requestHandler> 
```
  If itas interface is used, Do the same for itas,mobile,mobiledetails handlers. In this same handlers, update the gbHost and stHost to the GBrowse and storage backend host names (could be localhost, but Riak for example will listen on a specific network interface).
  
  The shards parameter tells the system to query all those index server (current host must be specified too) to get the matches. If not set, only current host is queried.
- Other parameters require advanced knowledge of the Solr system. Please refer to the Solr documentation.

conf/schema.xml:

This file contains the index structure. It defines both field types and fields. Many field types are predefined and should be enough (int types, string types...), but you should know that it is possible to develop your own field types. To do so, one can create a new fieldType definition in the schema, using existing tokenizers etc... or one can create new tokenizers or filters, add them to the classpath and specify it in the schema. Again, please refer to Solr documentation for more information. Available field types are in the installed schema.xml.

SeqCrawler index tool lowercase all the field names. As such, all field names should be declared lowercased in schema and queried as lowercase.

Fields need to be defined to be indexed. Example of fields definition:

   <!-- 
     Field with name "feature", will be indexed (searchable) and stored (value can be returned in search interface).
     It is indexed with type "lowercase" e.g. all letters will be lowercased but works won't be splitted.
   -->
   <field name="feature" type="lowercase" indexed="true" stored="true"/>

   <!--
     Field with name "start" is of type "int", e.g. an integer that can be queried with range values. It is indexed and stored (searchable and value available in matches) 
     but it is also multiValued (e.g. a same document can have multiple start fields).
   -->
   <field name="start" type="int" indexed="true" stored="true" multiValued="true"/>

   <!--
     Field with name text of type text (lowercased, cut in words), is indexed but not stored (searchable but content not available in interface).
   -->
   <field name="text" type="text" indexed="true" stored="false" multiValued="true"/>

Dynamic fields can also be defined, this means that a field is defined based on a pattern on its name (if and only if it is not explicitly defined in the schema):

  <!--
    This is a default that will match all fields not previously defined
    CAUTION: it is not multi-valued, fields with multiple values will be rejected here.
  -->
  <dynamicField name="*" type="text"    indexed="true"  stored="false"/>
  
  <!-- Example of a dynamic field matching all my* field names with a different definition -->
  <dynamicField name="my*" type="text"    indexed="true"  stored="true"/>

At least, here are a few important configuration elements (should not require modification for seqcrawler use cases):

 <!--
   Field used as a unique identifier. In seqcrawler, this field is automatically generated for each document.
 -->
 <uniqueKey>seqid</uniqueKey>

 <!-- 
   Field for the QueryParser to use when an explicit field name is absent
 -->
 <defaultSearchField>text</defaultSearchField>

 <!--
 SolrQueryParser configuration: defaultOperator="AND|OR"
 This means that query "solr lucene" will search for "solr OR lucene"
 -->
 <solrQueryParser defaultOperator="OR"/>

  <!--
  copyField commands copy one field to another at the time a document
  is added to the index.  It's used either to index the same field differently,
  or to add multiple fields to the same field for easier/faster searching.
  -->
   
  <copyField source="*" dest="text"/>

TOMCAT_HOME/webapps/solr_app_name/WEB-INF/web.xml

Update the parameter entry solr/home to point to the index directory.

It is also advised (but not mandatory) to secure the admin interface of Solr (solrurl/admin). This can be wdone this way:

<security-constraint>
    <web-resource-collection>
      <web-resource-name>
        Protected Section
      </web-resource-name>
      <!-- This would protect the entire site -->
      <url-pattern> /admin/* </url-pattern>
    </web-resource-collection>
    <auth-constraint>
      <!-- Roles that have access -->
      <role-name>crawleradmin</role-name>
    </auth-constraint>
  </security-constraint>

  <!-- Define reference to the user database for looking up roles -->
  <resource-env-ref>
    <description>
      Link to the UserDatabase instance from which we request lists of
      defined role names.  Typically, this will be connected to the global
      user database with a ResourceLink element in server.xml or the context
      configuration file for the Manager web application.
    </description>
    <resource-env-ref-name>users</resource-env-ref-name>
    <resource-env-ref-type>
      org.apache.catalina.UserDatabase
    </resource-env-ref-type>
  </resource-env-ref>

  <!-- BASIC authentication -->
  <login-config>
    <auth-method> BASIC </auth-method>
    <realm-name> Seqcrawler Basic Authentication </realm-name>
  </login-config>

  <!-- Security roles referenced by this web application -->
  <security-role>
    <description>
      The role that is required to log in to the Admin section
    </description>
    <role-name>crawleradmin</role-name>
  </security-role>

Then all you need is to define a user with the role crawleradmin in your web container.

Once configuration is over, the web container must be restarted (or the application reloaded).

SeqCrawler configuration:

SeqCrawler shells in solr/bin directory must be updated. the INDEXHOME environment variable needs to be set to the solr/bin path. Shells are not mandatory, they are only wrappers to call the call program directly.

A seqcrawler.properties is also present in solr/bin directory. Though not mandatory, it can be used to specify the solr.home and solr.data directories. Indeed, the program requires at least the solr.home variable to be set to run. Those variables can be specified via command-line with sh and sd parameters but it can be easier to specify them only once via the properties file.

Warning: seqcrawler.properties will be search in current path location when running a script or executing the Java jar. If script is executed from /tmp for example, the properties file should be in /tmp too.

The solr.home points to the solr directory (under which conf directory is present). It is mandatory.
The solr.data points to the directory that will contain the index of the current host. If solr.data points to /index/data, Solr will create sub-directories index and spellchecker. If solr.data is not set, then the default value set in solrconfig.xml will be used. This variable will often be used for intermediate index shard creation to point to specific data directory.
It is possible to define field recoders. With recoders, a field will be transformed on-the-fly to an other or multiple fields. It could be used to transform a group of data to single elements for example. The program do not provide any recoder for the moment but it is easy to developp new ones. To do so, a Java class must implements the FieldRecoder interface which provide a recode function. The returned list will be indexed the way fields would have been indexed in original document. Recoded field will not be indexed. If required, it should also be returned in the pair list. If a key is already present in the document, it will be concatenated.
To specify a recoder, add in the seqcrawler.properties a proerty with value: bankname.fieldname.recode = class_of_recoder Of course, the class should be in the classpath of the program.

Interface customization:

If installed, the basic web interface itas is accessible to query the index. It is managed by the itas request handler in solrconfig.xml. The web interface can be customized easilly. The templaces are Velocity templates located in SOLR_HOME/conf/velocity. The css and javascript files are under webapp_directory/itas.

A more complete interface with Ajax support is available via index.html. It uses JQuery and sends background requests to Solr. Fields and results are written with specific handlers depending on content-type (gff,embl,...). The Javascript can be modified to change the display of the results. The file main.css handles the CSS of the web page.

Riak install:

Riak is a really nice solution, however, it is quite slow for very large imports. If large amount of data needs to be imported first, I suggest the use of MongoDB (see next chapter).

Once Riak is installed and running, web interface to query it must be installed. A basic interface is available under riak directory of seqcrawler. This interface can be easily customized (dataquery.html).

Script to upload data need the curl program. Other tools can be used to upload data but script should be updated accordingly.

To load the web interface, just call the upload.sh with the name of the file to upload.

Web interface is accessible at http://myhost:8098/riak/web/dataquery.html with parameters:

id : Identifier of the element (gene id for example)
source : Identifier of the source element (id of the chromosome for example, to extract raw dna sequence)
start : (optional) Start position to extract from data source
stop : (optional) Stop position to extract from data source
The host name in upload.sh should be updated before:
```
curl --silent -X PUT -H "${content_type}" --data-binary @$file http://seqcrawler.genouest.org:8098/riak/web/$file;
```
All the following files must be uploaded:
dataquery.html
jquery-1.4.2.min.js
jquery-ui.1.8.2.custom.css
jquery-ui.1.8.2.custom.min.js
jquery.url.packed.js
main.css

MongoDB install:

Refer to the web site for the installation procedure. Note that it is packaged for several distributions (ubuntu, fedora...). To use MongoDB as a backend, the MongoDB perl package is also required. Once MongoDB is installed, configure it to point to a storage directory (in /etc/mongodb.conf) and give required rights to write in directory.

Data can be imported with the mongoimport tool with command like:

mongoimport --dbpath MYDBPATH -d seqcrawler -c bank --file MYJSONFILE

To provide a web interface to the storage, the following files must be copied in Apache directories:

In /var/www/mongo (depends on system)
- dataquery.html Web interface is accessible at http://myhostname/mongo/dataquery.html
- jquery-1.4.2.min.js
- jquery-ui.1.8.2.custom.css
- jquery-ui.1.8.2.custom.min.js
- jquery.url.packed.js
- main.css
In /usr/lib/cgi-bin/mongo (depends on system)
- mongo.pl Data can be queried with url like http://myhostname/cgi-bin/mongo/mongo.pl?id=BX005216

CrawIndex