Appendix 1 - Features and functionality of BOLD
BARCODE OF LIFE DATA SYSTEMS
BOLDSYSTEMS.org Handbo ok for BOLD 2.5 Nove m b e r 2009
B A R C O D E
L I F E
D A T A
S Y S T E M S
Ta bl e of C on t en t s BOLD Handbook
O F
General 1. Introduction 2. Taxonomy Browser 3. Signing up for BOLD 4. Projects on BOLD 5. Create a Project 6. Searching for data 7. Primer Database
..................................................2 ..................................................3 ..................................................4 ..................................................4 ..................................................5 ..................................................6 ..................................................7
Data Management 8. Data Submission 9. Image Submission 10. Trace & Primer Submission 11. Sequence Submission 12. Project Summary
..................................................8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Publishing 13. GenBank Submission 14. Bibliographies
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Analysis Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 15. Image Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 16. Distribution Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 17. Taxon ID Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 18. Identification Engine 19. Barcode Index Numbers (BINs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 20. Distance Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 21. Sequence Composition 22. Nearest Neighbour Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 23. DNA Degradation Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 24. Accumulation Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 25. Alignment Viewer External Connectivity 26. Web Services
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
This handbook provides details on BOLD functionality, data structures and best practices. It explains how to use this system to collect, manage and publish Barcode and ancillary data. It also provides details on the integrated analytical tools and web services. For an online version of our documentation, please visit: www.boldsystems.org/docs/ For assistance with any feature of BOLD, please email the BOLD Support Team: support@boldsystems.org
1
BOLDSYSTEMS.org
B A R C O D E
O F
L I F E
D A T A
S Y S T E M S
1. I nt rodu c t io n
The services of BOLD are both license free and platform independent. That means that anyone with a computer and internet connection can utilize both the storing capacity and the analytical tools that BOLD provides and since BOLD employs cloud computing, users are not restricted by their operating systems or speed of equipment.
BOLD Handbook
The Barcode of Life Data System (BOLD) is an informatics workbench aiding the acquisition, storage, analysis and publication of DNA barcode records. By assembling molecular, morphological and distributional data, it bridges the traditional bioinformatics chasm. BOLD is freely available to any researcher with interests in DNA Barcoding. By providing specialized services, it aids the assembly of records that meet the standards needed to gain BARCODE designation in the global sequence databases. Because of its web-based delivery and flexible data security model, it is also well positioned to support projects that involve broad research alliances.
Currently, BOLD has over 700, 000 Barcode records compiled by more than 4000 researchers from over 1000 different institutions. Several new features have been deployed in the last year as part of BOLD Version 2.5 including multi-locus support for Barcodes and identifications, new analytical tools, additional ancillary data, and enhanced performance for faster analytics and browsing experience. BOLD aims to keep up to date with the needs of both high thorough-put labs and smaller labs, as well as individual researchers. For more information, please email info@boldsystems.org. Thank you for your interest in the Barcode of Life Data Systems! Sincerely, the BOLD Team
Figure 1-1: Home Page of BOLD Systems as of November 2009: www.boldsystems.org.
BOLDSYSTEMS.org
2
B A R C O D E
L I F E
D A T A
S Y S T E M S
2. Ta x onomy B row s er BOLD Handbook
O F
The taxonomy browser is a public feature that allows users to examine the progress of DNA barcoding, at different levels in the taxonomic hierarchy. To access the taxonomy browser, click on the link in the menu bar on the BOLD main page. If you have a specific taxon of interest, enter the taxonomic name into the search bar in the upper right corner of the BOLD main page, or within the taxonomy browser. Progress is being made in Animals, Plants, Fungi, and Protists and users can browse through each kingdom from phylum down to the species level. The taxonomy browser is updated on a regular basis from all the records on BOLD. Figure 2-1 depicts the taxonomy browser for a selected taxon. See Table 2-1 for descriptions of each section featured on the taxonomy browser taxon page.
Figure 2-1: The BOLD taxonomy browser depicting the Family page for Tachinidae.
1 Lineage
Lists the higher taxonomic levels.
8 Graphic Displays of:
2 Specimen Records
The number of specimen records.
3 Specimens with Barcodes
The number of barcoded specimens of this taxon on BOLD.
4 Public Sequences
The number of public sequences and a link to download them.
5 List of Species Barcoded
A list of all species with records on BOLD. The number of specimens, the number of sequences and the number of sequences greater than 500bp are listed.
6 Link Outs
Links to several community partners’ pages for that specimen.
7 Lower Taxonomy
Links to all lower classifications along the left with number of specimen records.
»» the total number of barcodes and reference barcodes. »» the quantity of species barcodes and those used as reference barcodes. »» the institutions where the specimens are deposited. »» a map of the world highlighting specimen collection locations. »» a graph showing the frequency of specimens/ barcodes against age. »» a list of countries where specimens were collected, including the number of specimens from each country. »» various images of specimens within that taxonomic group.
Table 2-1: Information available at each taxonomic level within the BOLD taxonomy browser.
3
BOLDSYSTEMS.org
B A R C O D E
O F
L I F E
D A T A
S Y S T E M S
3. S i g ni ng u p fo r B O LD
On the BOLD main page, click on ‘Request an Account’ on the menu bar to access the new user registration form. After you have submitted your registration, a welcome e-mail will be sent to you with the information you need to log in and begin using your BOLD account. With your account you can access your private workspace which brings you to the Project List page, illustrated below.
Valid Email Address
Use a current institutional email.
First Name
Fill in your first name; first letter should be capitalized.
Middle Initial
Fill in middle initial(s) if needed, capitalized, followed by period(s).
Last Name
Fill in your last name; first letter should be capitalized.
Institutional Affiliation
Select the name of your institution or add a new institution if needed. (Private collections should be named in the format: “Research Collection of John Smith”.)
Password
Should be at least 5 characters.
BOLD Handbook
Getting an account on BOLD allows you to upload your data into a private workspace and take advantage of the integrated analytical tools.
Table 3-1: Information required to create a new user account on BOLD.
Figure 3-1: New user account application on BOLD.
If you forget your password or username for BOLD, there is a link on the main page to request a reminder email be sent to your email address. For security reasons, your session will eventually time-out. You will be prompted to sign back in when trying to navigate through the timed-out session. Signing back in at this point will bring you to the page you were attempting to reach when the time-out occurred.
4. P ro j e c t s o n B O L D The Project List page summarizes all projects you have access to, as well as all public projects on BOLD. From this page, there are a few features you can access: • Create New Project: Allows you to create a new project to store data. (See page 5 for details.) • Merge Projects: To merge projects, select the projects you wish to merge, then click on “Merge Projects”. This will create a virtual project for viewing, downloading and analyzing. • Search All Records: This feature allows you to perform a multi-term search among all of your projects and the public projects. (See page 6 for details.) • View All Primers: Directs you to the primer database. (See page 7 for details.) Within this page, you can also register a new primer. (See page 15 for details.) • Bibliography Submissions: Add a bibliographic entry to the publication database and associate your records. (See page 18 for details.) • Campaigns: You can navigate to any campaign that is Figure 4-1: Project List Page associated with your projects. This is a useful organizational tool when you are involved in many projects. • Public Projects: All public projects on BOLD are accessible in your project list. You can open and analyze these records individually by project or by virtually merging them with your own projects. You can access the public projects without logging in by clicking on “Published Projects” on the BOLD main page. Users will not be able to make changes to public projects. BOLDSYSTEMS.org
4
B A R C O D E
BOLD Handbook
O F
L I F E
5. Cre a t e a Pro je c t
D A T A
S Y S T E M S
Once logged into BOLD, select the ‘Create New Project’ link on the top left side of the project list page. This will take you to the New Project Submission Form. The following pieces of information need to be entered in order to create a project:
Project Title
Please create a descriptive name.
Project Code
A 3-5 letter code that needs to be unique across BOLD. A good approach is to use your initials and 2 or 3 other letters as an acronym for the title.
Project Type
Choose between the following options: • Data Project (contains specimen & sequence records) • Folder Project (contains other projects)
Primary Marker
Select your primary marker. CO1 is the default. Primary marker options are: • Cytochrome Oxidase Subunit 1 5’ • Region Interspacer (ITS) Region • rbcL • Maturase K (matK)
Secondary Marker
Select as many secondary markers as needed from the list of registered markers.
Campaign
Select the name of the campaign the project is part of or ‘None (General Project)’ if it is not part of a campaign.
Place in Container
Select the name of the Folder Project or ‘Independent Project’ if it does not belong in a folder project.
Project Description
Enter a summary of the use and intention of the project. 15 - 500 characters required.
Project Access
Check to make project publicly visible.
Project Manager The person who creates a project is automatically the Project Manager and has full specimen and sequence access. Assign Users
Other BOLD users can be added to a project. Different levels of access are possible. Sequence Access: • Analyze Only - user can perform analysis on the data, but cannot view more than a summary of the data (sequence and related information remain hidden). • View & Download - user can view or download the sequence data, as well as analyze. • Edit Sequences - user can upload trace files, upload, edit and delete sequences, as well as view and analyze. Specimen Access: • Edit Specimens - user has control over sample identifiers, taxonomy, collection data, and images of the specimens: this edit permission level is intended for project managers, collectors, and taxonomists.
Figure 5-1: The BOLD new project creation form.
Please note that the person who creates a project is automatically assigned as the Project Manager. The Project Manager can change project details and add/remove users at any time by simply clicking on ‘Modify Project Properties’ in the Project Options menu of the Project Console. (See page 16 for details). Secondary markers are added upon request. If a marker you require is not on the list, please contact BOLD support staff to register one through support@boldsystems.org.
Table 5-1: Field definitions for BOLD project creation form.
5
BOLDSYSTEMS.org
B A R C O D E
O F
L I F E
D A T A
S Y S T E M S
6. S e a rc hi ng fo r Da t a
There are two types of searches for BOLD: Basic Search and Advanced Search.
Taxonomy
Searches specific taxonomic names on BOLD. There is a text field for search terms that should be either included or excluded from the search.
Geography Country/Province
Searches the country and province names on BOLD. There is a text field for search terms that should be either included or excluded from the search.
Geography Region
Searches region names on BOLD. There is a text field for search terms that should be included in the search.
Sequence Length
Text fields to define minimum and maximum number of base pairs desired.
Basic search provides drop-down menus for taxonomy or geography on BOLD providing a wide scope of results. Advanced search allows for more specificity. Multiple criteria can be searched to narrow the scope of each search. In the Advanced Search section, you can also provide exclusion criterion for taxonomy and geography to avoid unwanted information in the results.
Taxonomy
Searches the taxonomic names from Phylum to Species on BOLD. Each drop down selection will refine the available choices in the next lowest box.
Geography
Country and State/Province: Searches the country and province names on BOLD.
Table 6-1: Explanation of the terms used within the Basic BOLD search functions.
BOLD Handbook
There is one common interface for the three ways to search BOLD. To search from (1) all projects, select the ‘Search All Records’ link in the list of BOLD projects. To search from within a (2) merged set of projects or a (3) single project, select ‘Search/ Filter’ from the Project Console.
Sample ID/Process Searches Sample IDs, Process IDs and ID/GenBank Acc. GenBank accessions on BOLD. By clicking the link to the right, you can paste a list of these IDs from a spreadsheet. Include Published Data and Data Mined from GenBank Data
When checked, the search also includes published data and sequences mined from GenBank which fit the Barcode profile.
Select a Single Representative per Species
When checked, the search will only retrieve one representative per species found by random selection from the entries with longest sequence length.
Table 6-2: Explanation of the terms used within the Advanced BOLD search functions.
Figure 6-1: The BOLD search engine, showing both basic and advanced search functions. BOLDSYSTEMS.org
6
B A R C O D E
L I F E
D A T A
S Y S T E M S
7. Pri me r D a t a b a se BOLD Handbook
O F
PCR and sequencing primers are required for the submission of trace electropherograms and for receiving Barcode designation on GenBank. The primer database is accessible from the BOLD project list and from within any project (Figure 7-1). Listed are all public primers, as well as any private primers registered by the user. New primers must be registered with BOLD before trace files are submitted. To register a new primer, select “Register Primers� from the Project Options menu in your project or through the Primer Database page. (See page 14 for more details.) Once you have registered a primer, you may edit it at any time.
Figure 7-1: Primer Database
Selecting a primer from the database will open the Primer Page (Figure 7-2), which provides details on the primer, including primer performance statistics derived from BOLD. Primer Pages can also be opened from the Sequence Data Page for any record. (See page 17 for more details.) To locate performance statistics for any primer pairing in a desired taxon, please visit the taxon page available via the Taxonomy Browser. Links to relevant publications to view PCR protocols for these primers can be located when available.
Figure 7-2: BOLD Primer Page
7
BOLDSYSTEMS.org
B A R C O D E
O F
L I F E
D A T A
S Y S T E M S
8. D a t a Sub m issio n P ro t o c o l Data Submission Protocol
The data submission is how records are created on BOLD. Each record is assigned a BOLD Process ID when uploaded. Images, traces, and sequences can then be uploaded with the Process IDs. There are two ways to enter records onto BOLD: manually or with bulk uploads through the BOLD Data Managers. This protocol assists in the submission of bulk data to BOLD through the BOLD Data Managers. This is the easiest way to populate your project with records, as well as the only way to enter new species taxonomy into the BOLD library. Described below is the necessary format of the data that is required for a correct submission. Whenever a bulk submission is sent to the BOLD Data Management Team through submissions@boldsystems.org, the following pieces of information need to be sent in the body of the email, with the standard submission spreadsheet (See page 9 for more details) attached: I. Project Title II. Project Code III. Project Manager IV. Priority Level (High, Intermediate or Low) - This determines where your submission will be placed in the queue. V. Submission Type (New Records or Update) If type is ‘Update’: Please specify which worksheets (Voucher Info, Taxonomy, Specimen Details, and/or Collection Data) need to be updated. See page 10 for more information on updates. VI. Campaign, iBOL Working Group, or a known checklist, if applicable. The data spreadsheet consists of 4 worksheets; a main specimen identifier worksheet (voucher info) that is linked to three other worksheets: taxonomy, specimen details, and collection data. (Refer to Tables 8-1 through 4 for field definitions.) See the next page for more instructions. Sample ID *
ID associated with the sample being sequenced (often an extension of field or Museum ID).
Field ID *
Field number from a collection event or Specimen identifier from a private collection.
Museum ID *
Catalog number in curated collection for a vouchered specimen.
Collection Code
Code associated with given collection. Only fill in if Museum ID is given.
Institution Storing *
Full name of the institution where specimen is vouchered.
Sample Donor
Full name of individual responsible for providing specimen or tissue sample.
Donor E-mail
E-mail of the sample donor.
Sex
Male/female/hermaphrodite only.
Reproduction Sexual/asexual/cyclic parthenogen only. Life Stage
Adult/immature only.
Extra Info
User Specified Characteristics (free text) Can be displayed on a tree or used to sort records. Limited to a maximum of 50 characters. Designate FAO region here.
Notes
Free text or XML tagged text. All XML text should be surrounded by the XML start (<xml>) and stop (<xml>) tags.
Table 8-3: Field definitions for Specimen Details page on accompanying spreadsheet.
Table 8-1: Field definitions for Voucher (Specimen) info page on accompanying spreadsheet.
Collectors
Comma delimited list of collectors.
Collection Date
Date of collection, must be in DD-MMMYYYY format (e.g. 01-Jan-2005).
Continent/Ocean ISO Continents and Oceans.
Full Taxonomy
Full taxonomy consisting of phylum*, class, order, family, subfamily (optional), genus, species in binomial format.
Country *
ISO Countries.
State/Province
States and provinces (according to Getty Geographical Thesaurus).
Region
Park, county, district, lake or river.
Sector
Sector of park or county/city.
Exact Site
Description of collection location.
Identifier
Full name of primary individual responsible for providing taxonomic identification of the specimen.
Identifier E-mail
E-mail address of the primary identifier.
GPS Coordinates Latitude & Longitude in “degrees.decimal degrees” format (e.g. 45.837).
Identifier Institution
Institution of the identifier.
Elevation/Depth
Table 8-2: Field definitions for Taxonomy page on accompanying spreadsheet.
Elevation or depth in metres only.
Table 8-4: Field definitions for Collection Data page on accompanying spreadsheet.
* Minimum required fields for new records. BOLDSYSTEMS.org
8
Data Submission Protocol
B A R C O D E
O F
L I F E
D A T A
S Y S T E M S
All of the data in BOLD are organized by projects. There is a limit of 999 entries for a given project, to keep the size manageable, though related projects can be grouped into containers or temporarily merged with related projects for analysis, etc. An individual entry in the database represents a barcode of a given specimen. The Process ID (assigned by BOLD upon specimen data record upload) uniquely represents a sample in BOLD. This is the identifier that is used to track a sample through the barcoding process: collection, taxonomic identification, sequencing, analysis and final publication of data. Specimen data can be entered in one of two ways. As outlined here, for larger sets of samples, the data can be entered on the Data Submission Template spreadsheet and sent to BOLD. Data managers will review the data, to ensure that it meets the minimum requirements, and input it into BOLD. For smaller numbers of entries, (ie: 1-10 records) users can enter sample data directly through the website by clicking on “Specimen Data” under the Uploads menu and using the manual interface there. Here is an example of a properly filled in data submission. You can get this blank template online from http://www.boldsystems.org/docs/SpecimenData.xls Use the tabs at the bottom of the Excel workbook to navigate through the four pages.
Sample ID
Field ID
Sample-demo01
Sample-demo01
Sample-demo02
Sample-demo02 15466-JUC-ISC
Sample-demo03
Sample-demo03
Specimen Info Collection Code
Museum voucher ID
ISC
Institution Storing
Sample Donor
Donor Email
Burke Museum
Joe Smith
jsmith@BIO.org
Burke Museum
Joe Smith
jsmith@BIO.org
Burke Museum
Joe Smith
jsmith@BIO.org
Figure 8-1: Example data for Specimen Info Taxonomy Order
Family
Subfamily
Genus
Species
Identifier
Identifier Email
Identifier Institution
Sample-demo01 Arthropoda Insecta
Diptera
Asilidae
Hydropsychinae
Efferia
Efferia aestuans
Joe Smith
jsmith@BIO.org
Oxford
Sample-demo02 Arthropoda Insecta
Diptera
Asilidae
Joe Smith
jsmith@BIO.org
Oxford
Sample-demo03 Arthropoda Insecta
Diptera
Joe Smith
jsmith@BIO.org
Oxford
Sample ID
Phylum
Class
Asilus
Figure 8-2: Example data for Taxonomy Reproduction
Specimen Details Life Stage Extra Info
Notes
Sample-demo01 Female
Sexual
Adult
Commonly called ‘Robber Fly’
Sample-demo02 Male
Sexual
Adult
Sample-demo03 Male
Sexual
Adult
Sample ID
Sex
feeding on fruit
Figure 8-3: Example data for Specimen Details
Sample ID
Collection Info Collection Continent State / Exact Collectors Country Region Sector Date / Ocean Province Site North America
Sample-demo01 Joe Smith
2-Feb-2009
Sample-demo02 Joe Smith
27-Jul-2007 Asia
Sample-demo03 Joe Smith
5-Apr-2007
Central America
Ontario
Japan
Hokkaido Soya
Omu
Costa Rica
Guanacaste
Mundo Neuvo
Figure 8-4: Example data for Collection Info
9
Wellington
Canada
BOLDSYSTEMS.org
ACG
Guelph
Riverside Park Ricon de la Vieija
Latitude
Longitude
Elevation
43.563 -80.270
325
44.671 142.788
95
10.772 -85.434
305
B A R C O D E
O F
L I F E
D A T A
S Y S T E M S
A new submission occurs every time new records are added to a project. An update means to modify records that already exist in a project. If you wish to only update one or two records, please manually select the specimen from the species record listing in your project and click on the “edit” button in the upper right corner. Any details can be edited in this way, except for adding new taxonomy to BOLD. If there is new taxonomy to add to the BOLD library this should be sent by email as a taxonomy update to the BOLD Data Management Team through submissions@boldsystems.org.
New Submission
Update Submission
New submissions are project specific so that their data can be associated with a project on BOLD. If records are submitted that need to be entered into different projects on BOLD, a separate file for each project needs to be sent. Provide as much detail and additional information as possible with a new submission so that it will take less time later to update the blanks.
The quickest way to update records in bulk is to download the Data Spreadsheet from BOLD containing the records that need to be modified. To do so, click on “Data Spreadsheets” from the Downloads menu on the upper left side of your project. Only download the worksheets and records that will be affected by the update (e.g. if the taxonomy needs to be updated, only download the Taxonomy worksheet; if specimen details and collection data need to be updated, only download the Specimen Details and Collection Data worksheets, etc.). Please do not download and submit updates on the core lab book.
The minimal requirements for a new submission on BOLD are: • Voucher Info Page - Sample ID • Voucher Info Page - Field ID and/or Museum voucher ID • Voucher Info Page - Institution Storing • Taxonomy Page - Phylum • Collection Page - Country Other useful information: Sample IDs: • It is important to use a unique and original format for the Sample IDs. If the Sample IDs provided are not original to BOLD, they will need to be changed before the data can go online. • Only the following characters may be used in the Sample ID, Field ID, and Museum ID: Numbers, letters, and ^ . : - _ ( ) # All other characters will be removed. Specimen Details: • If the sex, reproduction or life stage values for your specimen do not fit the accepted values for Specimen Details, please move the information to the Extra Info or Notes fields. • Remember that the “Extra Info” field can be displayed on a Taxon ID Tree on BOLD and thus it is best to enter data there that may help when analyzing the data on a tree. Identifiers/Donors: • In the case where the donor or identifier is deceased or retired, please make note of that in the email field. It is important to provide this information so we can keep the database up-to-date.
Once the worksheets are downloaded, modify the data and copy it into the standard submission spreadsheet. The submitted update must reflect what the data should be on BOLD. Please send this to the Data Management Team through submissions@ boldsystems.org. NOTE: Any fields left empty will be considered blank and will be removed from BOLD during an update. Do not remove any data from the update sheet if you’d like it to stay on BOLD. The system cannot distinguish between “blank: do not update this field’ or “blank: delete the content of this field”. Updates to Voucher Info are slightly different from updates to Taxonomy, Specimen Details, and Collection Data. a.) Updates to Voucher Info Identical to new submissions, updates to the voucher info are project specific. The records need to be split into their corresponding project. b.) Updates to Taxonomy, Specimen Details, and Collection Data Updates to these sections are project independent. Records from any number of projects can be updated in one submission spreadsheet, and the number of records are (in theory) infinite for this type of update. To select records from more than one project for a Taxonomy, Specimen Details or Collection Data update, you can use the search function (page 6), or merge projects in your project list.
BOLDSYSTEMS.org
10
Data Submission Protocol
There are two types of submissions: “New Submission” and “Update”.
B A R C O D E
L I F E
D A T A
S Y S T E M S
9. I mage Sub m issio n P ro t o c ol Image Submission Protocol
O F
Images should be uploaded to BOLD to complete a specimen record. An image provides support for identifications and makes comparisons easier between species.
Image File *
Complete (incl. extension) and identical file name (case sensitive) of images.
This protocol outlines the image submission process for BOLD. It describes the necessary format of the images and the ancillary data and the steps required to build the uploadable package required for a successful submission.
Original Specimen *
Enter “Yes” if the image shows the actual specimen for this record. Otherwise enter “No”.
View Metadata *
A short tag describing the orientation of the specimen that appears. Some standard views are: Dorsal, Lateral,Ventral, Frontal, etc.
Caption
Free text additional information about the image. Short descriptions are recommended, such as: part of organism photographed, life stage, sex, etc.
Measurement
Measurement that was taken (including the unit of measurement.).
Measurement Type
Item that was measured (e.g. body length, wing span, etc.).
Sample ID *
Sample ID for record, which must match Sample ID in BOLD.
Process ID *
Process ID for record, which must match Process ID in BOLD.
Copyright
Short free text field for photographer’s name or institution and date if necessary.
Copyright Details
Pick one of the following types: • copyright • creativecommons – share-alike • creativecommons – attribution • creativecommons – noncommercial
1. Collect Images: Collect high-quality images of specimens in .jpg format for your project. BOLD accepts high resolution images, (up to 20 megapixels) but only displays a greatly reduced thumbnail. The high resolution image is archived, but will not be used without the submitter’s consent. Refer to page 12 for a guide on picture orientation and quality. 2. Assemble Package: The image submission package should consist of all .jpg format images and a spreadsheet with the file names and ancillary data. Make sure that all images in the package are accounted for in the spreadsheet. When submitting more than one image per specimen simply copy the ‘Sample ID’ and ‘Process ID’ to the next line with the file name of the consecutive image. You can upload up to 10 images per specimen, depending on organism characteristics. Please photograph several different orientations if needed.
Table 9-1: Field definitions for accompanying spreadsheet.
The submission spreadsheet should be named ImageData.xls and * Required Fields contain the columns described in Table 9-1. Steps: A. Fill in the ImageData.xls spreadsheet with all the data related to the images in the submission package. To easily create the list of image files in a folder, open a terminal window (Start > Run > cmd in Windows), navigate to the folder containing the image files, and run one of the following commands: • • •
Windows MacOS Linux/Unix
dir /b *.jpg>list.txt ls *.jpg*.JPG>list.txt ls *.jpg*.JPG>list.txt
These commands will generate a list of all the files in the current folder and save it in a document called ‘list.txt’ that will appear in the current folder. You can then open list.txt and move the data into the Image File column. Obtain the Process IDs by clicking on “Data Spreadsheets” under the Downloads menu on the left side of a project console. Download the Core Lab Book to get the Process IDs that are assigned to each Sample ID. Please see the next page for further instructions. Image File
Original View Specimen Metadata
Caption
Measurement
Measurement Type
Sample Id
Process Id
Copyright
Copyright Details
ROM912-D.jpg
yes
Dorsal
skull
15 mm
skull length
ROM 10912
BM272-03
J. Beck 2008
copyright
ROM912-L.jpg
yes
Lateral
lower jaw
7 mm
length
ROM 10912
BM272-03
J. Beck 2008
copyright
ROM913-L2.jpg
yes
Lateral
skull
15 mm
skull length
ROM 10913
BM273-03
J. Beck 2008
copyright
ROM913-V.jpg
yes
Ventral
skull
15 mm
skull length
ROM 10913
BM273-03
J. Beck 2008
copyright
ROM914-D2.jpg
yes
Dorsal
skin
50 mm
body length
ROM 10914
BM274-03
J. Beck 2008
copyright
Figure 9-1: Image Submission Spreadsheet (ImageData.xls) completed with sample data.
11
BOLDSYSTEMS.org
B A R C O D E
O F
L I F E
D A T A
S Y S T E M S
C. BOLD will accept a maximum zipped file size of 190 MB. Upload the images to BOLD by clicking on the “Specimen Images” link in the Uploads menu of the desired project. Select the zipped folder of images and then hit “submit”. Do not navigate away from the project or close the pop-up window until the successful upload message is displayed. Tips and Troubleshooting for Image Uploads
• Zipped files must be under 190MB in size. If the upload fails to initialize, the zipped file may be too large. Break it into more than one upload, each with its own spreadsheet.
• The spreadsheet cannot contain any formulas. • If the upload program cannot find the image files, it is possibly because it can not read the names. Make sure that the spread• • • • •
sheet contains text values only. Full filenames must be used in spreadsheet. The extension (.jpg) must be included in the image file name. The file extension is case sensitive. Spreadsheet must be named ImageData.xls. If the upload program can not find the spreadsheet, confirm that it is named correctly (case sensitive). Max of 30 characters in the free text fields of the spreadsheet. Verify the data length in these fields and make adjustments if necessary. Data must start on the second line of the spreadsheet. There is only one line for the column headers. Adding extra columns to the sheet will cause errors.
Image Submission Protocol
B. These two components (Image files and Spreadsheet) need to be placed in a single folder. Compress them all into a single file before submitting. The following free tools are available to provide this functionality: »» WinZip - http://www.winzip.com »» WinRar - http://www.rarsoft.com »» MacZipIt - http://www.maczipit.com
You can upload more images in separate batches to any record at any time. If you wish to delete images for a record, please contact the BOLD Support Team through support@boldsystems.org.
Photography Guide The following standard orientations should be adhered to when appropriate: Please take pictures using the high quality mode on your camera. The specimen should be centered in the image frame. Photos should be taken as close-up to the specimen as possible, leaving very little gap around the edges. Landscape orientation. 2x3 aspect ratio, if possible. (The 2x3 aspect ratio will ensure that the images are not skewed when viewed in the image library.) If desired, a measurement scale may be included in the image to provide a size reference. Use a standardized position, as this makes it much easier to compare specimens within a project. See Figure 9-2 for some common standardized animal orientations.
• • • • • • •
Dorsal
• The anterior of the specimen should be facing the top of the image frame. • The specimen should be face-down, with the dorsal aspect of the head visible.
Lateral
• The anterior of the specimen should be facing the left side of the image frame. • The specimen should be oriented with the feet towards the bottom of the image.
Ventral
• The anterior of the specimen should be facing the top of the image frame. • The specimen should be face-up, with the ventral aspect of the head visible.
Figures 9-2: Suggested formats for photographs.
BOLDSYSTEMS.org
12
B A R C O D E
L I F E
D A T A
S Y S T E M S
10. Tra c e Fi l e S u b m issio n P roto co l Tr a c e Fi l e S u b m i s s i o n P r o t o c o l
O F
Trace files provide support for sequences and should be uploaded for every specimen record. They can be uploaded once the data submission step is completed and BOLD has assigned a Process ID to each record.
Trace File *
Complete (including extension) and identical file name (case sensitive).
Score File
Complete (including extension) and identical file name (case sensitive).
PCR Primers Fwd/Rev *
Primer codes are case sensitive. Both must be filled in.
This protocol assists in the submission of trace files to BOLD. It describes the necessary format of the files and the ancillary data that is required for the correct submission.
Sequence Primer
Primer codes are case sensitive.
Read Direction *
Forward or Reverse.
Process ID *
Process ID of record, which must match Process ID in BOLD.
1. Confirm primers are registered on BOLD. See page 14 for details on how to register new primers.
Marker
2. Assemble Package: The submission package consists of trace files (.ab1), corresponding phred (score) files if available (.phd.1) and a spreadsheet with the file names and ancillary data. The submission spreadsheet should be named data.xls and contain the columns described in Table 10-1. Steps: A. Fill in the data.xls sheet with all the data about your files. To easily create the list of the files in a folder, you need to open a terminal window (Start > Run > cmd in Windows), navigate to the folder where the trace and score files have been placed and run one set of the following commands:
• • •
Windows dir /b *.ab1>ab1.txt and dir /b *.phd.1 >phd.txt MacOS ls *.ab1>ab1.txt and ls *.phd.1 > phd.txt Linux/Unix ls *.ab1>ab1.txt and ls *.phd.1 > phd.txt
These commands will generate lists of all the files in the current folder. They will be saved as ab1.txt and phd.txt text files. You can then open the text files and move the data into the appropriate columns. Obtain the Process IDs by clicking on “Data Spreadsheets” under the Downloads menu on the left side of a project console.
If sequencing multiple genes, the marker needs to be filled in to match the (2 blank columns shortform marker in your project, such as must be left after one of the following: the Process ID • COI-5P column) • ITS • rbcLa • matK
Table 10-1: Field definitions for accompanying spreadsheet. * Required Fields Download the Core Lab Book to get the Process IDs that are assigned to each Sample ID. B. These components (Trace files, Score files and Spreadsheet) need to be placed in a single folder. Compress them all into a single zipped file before submitting. The following free tools are available to provide this functionality: »» WinZip - http://www.winzip.com »» WinRar - http://www.rarsoft.com »» MacZipIt - http://www.maczipit.com C. BOLD will accept a maximum file size of 190MB. Upload the traces to BOLD by clicking on the link “Trace Files” in the Uploads panel of the desired project. Select the zipped folder of files, the run site† and then hit “submit”. Do not navigate away from the project or close the pop-up window until the successful upload message is displayed.
PCR Fwd KKBNA001-04.ab1 KKBNA001-04.phd.1 BirdF1 KKBNA001-04r.ab1 KKBNA001-04r.phd.1 BirdF1
PCR Rev BirdR1 BirdR1
Seq Primer BirdF1 BirdR1
Read Direction Forward Reverse
KKBNA001-04 KKBNA001-04
COI-5P COI-5P
KKBNA002-04.ab1
BirdF1
BirdR1
BirdF1
Forward
KKBNA002-04
COI-5P
KKBNA002-04r.ab1 KKBNA002-04r.phd.1 BirdF1
BirdR1
BirdR1
Reverse
KKBNA002-04
COI-5P
KKBNA003-04.ab1 KKBNA003-04.phd.1 BirdF1 KKBNA003-04r.ab1 KKBNA003-04r.phd.1 BirdF1 KKBNA004-04.ab1 KKBNA004-04.phd.1 BirdF1
BirdR1 BirdR1 BirdR1
BirdF1 BirdR1 BirdF1
Forward Reverse Forward
KKBNA003-04 KKBNA003-04 KKBNA004-04
COI-5P COI-5P COI-5P
Trace File
Score File
KKBNA002-04.phd.1
Process ID
blank
blank
Marker
Figure 10-1: Trace File Submission Spreadsheet (data.xls) completed with sample data. † Attribution: In an effort to provide attribution to the labs generating trace files, BOLD has implemented a run site field for each trace upload.
13
BOLDSYSTEMS.org
B A R C O D E
O F
L I F E
D A T A
S Y S T E M S
Tips and Troubleshooting For Image Uploads
Tr a c e Fi l e S u b m i s s i o n P r o t o c o l
• Primers must be registered before upload. If the primers are not registered, there will be an error. Please refer to the page below for details on how to register new primers. • Zipped file must be under 190MB in size. If the upload fails to initialize, it is probably because the zipped file is too large. Try breaking it into more than one upload, each with its own spreadsheet. • The spreadsheet cannot contain any formulas. • If the upload program can not find the files, it is possibly because it can not read the names. Make sure that you have text values only in the spreadsheet. • Full filenames must be used in spreadsheet. The extension (.ab1, .phd.1) must be included in the file name.These extensions are case sensitive. • Spreadsheet must be named data.xls. If the upload program can not find the spreadsheet, confirm that it is named correctly (case sensitive). • Data must start on the second line of the spreadsheet. There is only one line for the column headers. • Do not add extra columns to the spreadsheet. • Trace files will not be downloadable from BOLD until 24 hours after they have been submitted. You can upload more traces in separate batches to any record at any time. If you wish to delete any traces for a record, please contact the BOLD Support Team through support@boldsystems.org.
New Primer Registration Be sure that your primers are registered with BOLD before assembling the submission package. To register your primers, select “Register Primers” from the Project Options menu in your project on BOLD or from the primer list page. Please note: If the primer sequence you used has already been registered under a different code, you will be provided with the registered code to be used in your submission. Primers you own can be edited at any time after they are created (e.g. to make them public).
Primer Code
Create a code for your primer. If the primer is already published in a manuscript, please use the code that is in press.
Primer Description
A description of what the primer is used for.
Alias Codes
Fill in any other known code names for your primer, separated by commas.
Target Marker
Select the target marker from the controlled list (e.g. ITS, COI 5’, matK, etc.).
Cocktail Primer
Select “Yes” if it is a cocktail primer. This will create extra fields to add multiple sequences.
Primer Sequence
Fill in the sequence(s), 5’ to 3’.
Direction
Select the direction of the sequence.
Reference/ Citation
List references and/or citations.
Notes
Notes about the primer.
Publicly Available
If the primer has already been published, or if you wish to make it publicly available, this should be left public. The other option is to keep the primer private until publication.
Table 10-2: Field definitions for Primer Submission Form. Figure 10-2: BOLD Primer submission form
BOLDSYSTEMS.org
14
B A R C O D E
L I F E
D A T A
11. Se qu e n c e S u b m issio n P ro to co l Sequence Submission Protocol
O F
S Y S T E M S
This protocol outlines the DNA sequence submission process on BOLD. It describes the sequence format and steps required for a successful submission. 1. Assemble Package: The sequence submission package should consist of aligned sequences in FASTA format referenced by BOLD Process IDs or your Sample IDs. To upload with Process IDs, the FASTA header line must conform to the following format: it should begin with a ‘>’ followed by the Process ID, with any additional information separated by either a bar (‘|’), an underscore (‘_’) or a space (‘ ’). There can be no spaces before the end of the Process ID. To upload with Sample IDs, the FASTA header line must conform to the following format: it should begin with a ‘>’, followed by the Sample ID, with any additional information separated by a bar (‘|’). Do not use a space or an underscore to separate information from the Sample ID. 2. Upload Package: You can include up to 1000 sequences into one upload. Upload the sequences to BOLD by clicking on the link “Sequences” in the Uploads menu of the desired project. Select the marker and sequencing institution†. Paste the sequences into the text box. Before you upload the sequences, you can use the “Preview Tree” button to confirm your sequences. A glance at the Neighbour Joining Tree provides an opportunity to diagnose contaminations and sample mix-ups before uploading. When confirmed, click on “submit”. »» If you wish to replace a sequence on BOLD, simply upload the new one with the same Process ID or Sample ID. »» If you wish to delete a sequence on BOLD, simply upload “NNNNN” associated with the Process ID or Sample ID. Or to delete an individual sequence, you can do so by using the Delete button within a record’s sequence data page (for more info on Sequence pages, please see page 17).
Figure 11-1: Sequence Upload Window
Figure 11-2: Preview Tree Example of Sequences in FASTA Format:
>TZBNA001-05|species name|region CTGCAGGANCAAAAAATGAAGTATTTAAATTTCGATCTGTTAATAATATAGTAATAGCTCCTGCTAATACAGGTAAAGATAATAATAATAAAAAAGCTGTAATTCCTACAGCTCAAACGAAAAGGGGTAGTTGATCGAAAAATATATTATTTAATCGTATATTAATAATAGTTGTAATAAAATTAATTGCTCCTAAAATAGAAGAA >TZBNA002-05 CAGCTAATACGGGTAAAGATAATAATAATAAAAAAGCTGTAATTCCTACTGCCCAAACAAAAAGAGGTAATTGATCAAAAAATATATTATTTAAGCGTATATTAATAATAGTTGTAATAAAATTAATTGCCCCTAAAATAGAAGAAATTCCTGCTAAATGAAGAGAAAAAATAGCTAAATCTACAGAACTACCCCCATGGGCGATATTAGAAGATAATGGGGGGTAGACTGTTCATCCTGTT >TZBNA012-05 AAAATAGCTAAATCAACTGAGCTTCCTCCATGAGCAATATTAGATGATAGTGGGGGGTAAACTGTTCATCCTGTTCCAGCTCCATTTTCTACCACTCTTCTTGAAATTAAAAGAGTAATAGAAGGGGGGAGTAATCAAAATCTTATATTATTTATTCGTGGGAAAGCN
† Attribution: In an effort to provide attribution to the labs generating sequences, BOLD has implemented a run site field for each sequence upload.
15
BOLDSYSTEMS.org
B A R C O D E
O F
L I F E
D A T A
S Y S T E M S
12. P ro j e c t S u m m a r y BOLD Project Management
Once your project has been populated with the specimen data, images, traces and sequences that you have uploaded to BOLD, it will resemble the figure on the right. Project Console The project console is the main summary of a project which reports on progress and lists all actions and analysis open to a project member. The console displays a report of the number of specimens, along with tallies of any missing components of the records. Also included are graphs to provide a quick visual overview of the project, as well as a list of all the users with access to the project. The links to the left provide access to uploads, downloads and various analysis tools. Project Managers will see the “Modify Project Properties” button with which they can change the project title and description, add or remove markers, and add, remove or modify permissions of users at any time. The Project Manager also has access to publish the records in the project to GenBank. (See page 17 for more details on GenBank submissions.) To access a list of the records within each project, click on “View All Records” in the project options menu.
Figure 12-1: Project Console
Record List A Project Record List is the full list of all records within a project, along with the actions and tools open to a project member. The record list gives access to individual specimen and sequence data for each record. You can select specific records for analysis or downloads using the checkboxes. Flags • Icons appearing next to a record indicate the presence of certain characteristics of a record; see Table 12-1 for more details. • A red-highlighted, bolded sequence length is a warning that the sequence contains more than 1% ambiguity and won’t meet the Barcode Standard. The red arrows along the column headers can be used to sort the records by header. The Project Manager or a user with full edit permissions can move records from one project to another by selecting the records needed and then clicking on “Move records to Another Project” in the Options menu. The destination projects that will appear in the list will be ones in which the user has full permissions to. Click on the Sample ID or the Process ID to access the Specimen Data and Sequence Data respectively, for each record. These pages are illustrated on page 17.
Figure 12-2: Record List
BOLDSYSTEMS.org
GPS coordinates present for sample Images present for sample The number of traces present Stop codons present in sequence Contamination present in sequence Flagged record, not in ID engine
Table 12-1: BOLD Record List icon legend
16
B A R C O D E
O F
L I F E
D A T A
Project Management and Publishing
Specimen Data Page
This page provides voucher details, taxonomy, specimen details and collection data for a specimen. Any user with specimen editing permissions can edit the records by selecting “Edit Specimen” from the upper right corner. Selecting the species name will open the appropriate taxonomy browser page for that taxon. There is a world map marked with the location where the specimen was collected. By clicking on the map, you can access the distribution mapping options for that species, (See page 20 for more details.) The images for the specimen are located at the bottom of the window. By clicking on the images, you will access a larger copy of each image.
S Y S T E M S Sequence Data Page
This page gives access to details about the sequence data for a specimen. Different markers can be selected by clicking on their links in the marker section. Trace files can be viewed or downloaded from this window. Primers used can be viewed by clicking on the primer codes. Sequences can also be deleted by users with full sequence access. If desired, the ID engine can be used to identify the sequence. (See page 26 for more details.) An illustrative barcode sequence of the species is provided, along with a link to the Laboratory Information Management System (LIMS) for the Canadian Centre for DNA Barcoding when available.
Figure 12-4: Sequence Data Page
Figure 12-3: Specimen Data Page
13. Ge n Ba nk S u b m issio n BOLD has an automatic submission tool for Project Managers to publish sequences to GenBank. All records within the project opened will be submitted to GenBank. If only a subset of records need to go to GenBank, then the records should be moved to a new project which can be submitted. GenBank accession numbers are generally returned by email to the project manager within five business days. The accession numbers can then be associated on BOLD with your records for quick reference. Once your paper is published, you need to make your BOLD project public. The Project Manager can do this by clicking on “Modify Project Properties” and checking off the box that says “Make this project publicly visible”. Once your paper is published, please send the citation and a link to the paper on the journal’s website to the BOLD Support Team through support@boldsystems.org so that it can be attached to your project (or multiple projects if all were used for one paper) and included in our publication database. Submissions to GenBank are locked for 1 year to allow time for publication. If the user wants the sequences published on GenBank sooner, they can contact GenBank directly to request public release.
17
BOLDSYSTEMS.org
Figure 13-1: GenBank Submission Form
B A R C O D E
O F
L I F E
D A T A
S Y S T E M S
14. B i bl i og rap h ie s BOLD Publishing
Users can add bibliographies to BOLD using the Bibliography Submission Form available in three locations: • On the list of BOLD projects • Within a project • Within a record list page (Allows you to first select the records to populate the Primary Associated Records field.) Any user with edit sequence or edit specimen permissions to records will have the ability to submit a bibliography connected to those records. A bibliography page will be available from the individual record pages once submitted.
Article Title *
The name of the article/paper.
Authors *
A list of the authors of the article.
Journal *
The name of the journal that the article is published in.
Journal Details
Fill in year*, volume, issue and the page range.
Abstract *
Fill in the official abstract.
Dates
Date published*, date received, date revised and date accepted.
PubMed Info
The PubMed ID (PMID) and PubMed Central ID (PMCID).
DOI
Unique Digital Object Identifier.
URL
The URL of the paper from the journal’s site.
Language
Language(s) the paper is written in, separated by a comma.
Open Access
Check here if the published paper is openly accessible.
Keywords
List all keywords for the paper, separated by a comma.
Associated Records
The primary and secondary record’s GenBank accessions can be filled in here, separated by a line. If the user submitting this bibliography does not have edit access to the primary records on BOLD, then they won’t be able to upload the linked publication.
Figure 14-1: Bibliography Submission Form
Table 14-1: Field definitions for Bibliography Submission Form. * Required Fields
Figure 14-2: Bibliography
BOLDSYSTEMS.org
18
B A R C O D E
L I F E
D A T A
S Y S T E M S
A na ly t i c a l To o ls I n t ro d u c t io n B O L D A n a l y t i c a l To o l s
O F
BOLD includes core and extended tools to analyze specimen and sequence data: Core Analysis Tools Image Library: Compare morphological characteristics Distribution Maps: Interact with geographical data Taxon ID Tree:Visualize a neighbour joining tree with matching images Identification Engine: Locate closest matches to an unknown sequence Barcode Index Numbers (BINs):View Barcode clusters Extended Analysis Tools Distance Summary: Browse sequence divergence at multiple taxonomic levels Sequence Composition: Explore compositional variation at all codon positions Nearest Neighbour Summary: Evaluate the Barcode gap DNA Degradation Test: Equate sequence quality with specimen age Accumulation Curve: Review sampling efficiency Alignment Viewer: Diagnose unaligned sequences
When the â&#x20AC;&#x153;Expandâ&#x20AC;? icon (shown to the left) appears next to a graph, the graph is expandable for a better quality version that can be used in publications.
15. I mage L ib r a r y Once images have been uploaded to your project, you can view them in two ways. The first is by opening an individual record where any images corresponding will be displayed below the specimen data. The second is the Image Library for viewing a group of specimens, shown in Figure 15-1. The Image Library allows you to sort by orientation so you can compare morphological differences between specimens. This tool is useful for diagnosing contamination or misidentifications as taxonomy is displayed below each image.
Figure 15-1: Example of the Image Library (Saturniidae)
19
BOLDSYSTEMS.org
B A R C O D E
O F
L I F E
D A T A
S Y S T E M S
16. D i s t ri b utio n M ap s B O L D A n a l y t i c a l To o l s
BOLDâ&#x20AC;&#x2122;s Distribution Maps plot the collection points for a selected set of specimens when geographic reference data is available. The multiple mapping tools available on BOLD are described below. Quick Map The Quick Map is built using the NASA Blue Marble Project. They are open access and therefore can be reused and modified at the userâ&#x20AC;&#x2122;s discretion. Users can zoom in to a max of 1 km per pixel by using the scale at the bottom or clicking on a region in the map and can pan the map in 4 directions by clicking on the N/E/S/W directions in the frame. The collection points are shown on the map using markers specifying the density of sampling at each point. Larger markers are places beneath smaller ones so all points can be visible. Figure 16-1: Quick Map Multi-Layer Map The Multi-Layer Map is based on Google maps. Google gives permission for use in publications as long as the Google logo remains on the image. The layers include political boundaries with regional names, as well as a satellite map of the world. These can be viewed individually or combined, which is shown to the right. The markers on this map are active and can be clicked to retrieve a list of the specimens from BOLD. This record list is also active, meaning users can open specimen records, which can be edited directly. This is a great way to validate and correct GPS data.
Figure 16-2: Multi-Layer map with hybrid view Google Earth The Google Earth map is a display of the specimen collection points in the program Google Earth. This is free software that can be downloaded from the web. Google gives permission for use in publications as long as the Google logo remains on the image. The benefit of this type of map is that it provides a portable KML file download which can be shared among colleagues. This map is embedded with specimen images, along with specimen identifiers, country, province/state, institution/ collection, and extra info.
Figure 16-3: Google Earth BOLDSYSTEMS.org
20
B A R C O D E
L I F E
D A T A
S Y S T E M S
17. Ta x o n I D Tre e B O L D A n a l y t i c a l To o l s
O F
This section describes how to use the Taxon ID Tree. The user can access this feature by clicking on “Taxon ID Tree” under the Sequence Analysis panel on the project console. If the tree is only desired for a selection of records, it may be accessed from the Project Record List. Below are examples of what the tree types will look like. The user can view and save a modifiable PDF of the tree, export the tree to Newick format and view a taxonomy report, as well as view the matching image library and the accompanying data spreadsheet. Once a tree is built, it can be used to compare specimens. Users can identify specimens, as well as catch misidentifications and contaminations.
Sequence Data
Nucleotide
Distance Model
• • •
Tree Building Method
Neighbor Joining is the only method at this time.
Alignment Options
Option to allow BOLD to align sequences automatically.
Select Terminal Branch Labels
Many options for labels to add to the end of each branch.
Photographs
Option to include matching specimen photographs and spreadsheet for comparison.
Codon Positions Included
1st, 2nd and 3rd Codon Positions may be included.
Apply Filters
Can be applied to disregard sequences below a given length (since very short sequences can skew the results) or to analyze only sequences with less than 1% ambiguous bases.
Colorize Tree Based on
• • • • • • •
Kimura 2 Parameter Jukes Cantor Pairwise Distance
Taxonomy: Class Taxonomy: Order Taxonomy: Family Taxonomy: Subfamily Location: Country Extra Info Sequence Age
Table 17-1: Parameters available for Taxon ID Tree
Figure 17-2: Standard Taxon ID Tree (Lepidoptera)
21
Figure 17-1: BOLD Tree Parameter Page
Figure 17-3: Circle Tree expected by the end of 2009 BOLDSYSTEMS.org
B A R C O D E
O F
L I F E
D A T A
S Y S T E M S
18. Ide nt i f i c a t io n E n g in e B O L D A n a l y t i c a l To o l s
The library of sequences collected in BOLD is available for facilitating identification of unknown sequences.The ID engine uses all sequences uploaded to BOLD from private, as well as public projects to locate the closest match. To protect BOLD users, no sequence information from private records is exposed. Animal Identification (COI) The BOLD Identification System (IDS) accepts sequences from the 5â&#x20AC;&#x2122; region of the mitochondrial gene COI and returns a species-level identification when possible. Further validation with independent genetic markers is desirable in some forensic applications. BOLD uses the BLAST algorithm to identify single base indels before aligning the protein translation through profile to a Hidden Markov Model of the COI protein. There are four databases within BOLD for use in identification of COI sequences: 1. All Barcode Records Database includes: Every COI barcode record on BOLD with a minimum sequence length of 500bp (Warning: This is an un-validated database and includes records without species level identification).This includes many species represented by only one or two specimens, as well as all species with interim taxonomy. This search only returns a list of the nearest matches and does not provide a probability of placement to a taxon.
Figure 18-1: Identification Engine Results Page for COI
2. Species Level Barcode Database includes: Every COI barcode record with a species level identification and a minimum sequence length of 500bp (Warning: This is an unvalidated dataset). This includes many species represented by only one or two specimens, and all species with interim taxonomy.
Figure 18-2: COI Identification Engine Tree Results 3. Public Record Barcode Database includes: All published COI records from BOLD and GenBank with a minimum sequence length of 500bp. This library is a collection of records from the published projects section of BOLD. 4. Full Length Record Barcode Database includes: A subset of the Species library with a minimum sequence length of 640bp and containing both public and private records.This library is intended for short sequence identification as it provides maximum overlap with short reads from the barcode region of COI. Fungal (ITS) and Plant (rbcL & matK) Identification In the BOLD Identification System (IDS), ITS is the default identification tool for fungal barcodes and rbcL and matK are the defaults for plant barcodes. Both return a species-level identification when possible. Further validation with independent genetic markers will be desirable in some forensic applications. The BLAST algorithm is employed in place of BOLDâ&#x20AC;&#x2122;s internal identification engine for these sequences. There are relatively few fungal and plant records on BOLD so most queries will likely not return a successful match. This will improve as sampling efforts continue in these kingdoms. These databases include many species represented by only one or two specimens, as well as all species with interim taxonomy. Both searches only return a list of the nearest matches and do not provide a probability of placement to a taxon. Fungal Database includes: Every ITS barcode record on BOLD with a minimum sequence length of 100bp (Warning:This is an un-validated database that includes records without species level identification). Plant Database includes: Every rbcL and matK barcode record on BOLD with a minimum sequence length of 500bp (Warning: This is an un-validated database that includes records without species level identification). BOLDSYSTEMS.org
22
B A R C O D E
L I F E
D A T A
S Y S T E M S
19. Ba rc ode I n d ex N u m b e r Sy stem ( B I N s) B O L D A n a l y t i c a l To o l s
O F
The Barcode Index Number system (BINs) is an alternate method for species identification that makes use of a novel annotation tagging system for barcode clusters generated by a graph-theoretic clustering technique introduced by Ratnasingham and Hebert (In prep, 2009). This system consists of two parts: â&#x20AC;˘ A novel clustering algorithm employing graph methods to generate operational taxonomic units (OTUs) and putative species from sequence data without prior taxonomic information, â&#x20AC;˘ A curated registry of barcode clusters integrated with an online database of specimen and taxonomic data with support for community annotations. The BIN framework can greatly expedite the evaluation and annotation of described species and putative new ones while reducing the need to generate interim names, a non-trivial issue in barcoding datasets. For example, nearly half of the species names on BOLD are interim and their number is growing at a rapid rate. The BIN algorithm has been effectively tested on a broad set of taxonomic groups and shows potential for applications in species abundance studies and environmental barcoding. The registry employs modern GUID and web service functionality enabling integration with other databases. BINs are viewable on both the specimen pages and sequence pages for each record that has enough relevant data to be incorporated in the BIN system. Each has a link to the BIN summary page and taxonomy browser pages for species names. Figure 19-1: Depiction of BIN AAA1568 summary page
Figure 19-3: Depiction of BIN information on Sequence Page
Figure 19-2: Depiction of BIN information on Specimen Page
23
*The BIN system is currently under-going beta testing and is expected to be deployed by the end of 2009. BOLDSYSTEMS.org
B A R C O D E
O F
L I F E
D A T A
S Y S T E M S
20. D i s t a n c e S u m m a r y B O L D A n a l y t i c a l To o l s
It is desirable for barcodes to show very low sequence divergence within a species, with significantly higher sequence divergence at higher taxonomic levels. The Distance Summary tool gives a report of sequence divergence between barcode sequences at the level of species, genus, family, order, and class. Distance values are calculated using the Kimura 2 Parameter method. Comparisons are performed between the given taxonomic levels with the frequency is plotted as shown in Figure 20-1. Details for the comparisons done at the level of species, genus, and family are available by clicking on the links in the top right corner of the Distance Summary browser.
Figure 20-1: BOLD Distance Summary
21. S e q u e n c e C o m p o s i t i o n The frequency of DNA bases, observed with emphasis on GC-content, can be a useful metric for evolutionary biologists. GC-content within the barcoding region of CO1 has been correlated with GC-content of the entire mitochondrial genome for many species. Using the Sequence Composition tool allows the user to view the frequency of each base, G, C,A and T, as well as graphics for GC content on all codon positions. This information includes overall sequence composition, as well as for codon positions 1, 2, and 3. â&#x20AC;&#x153;Detailed Viewâ&#x20AC;? tabulates the results for each specimen.
Figure 21-1: BOLD Sequence Composition
BOLDSYSTEMS.org
24
B A R C O D E
L I F E
D A T A
S Y S T E M S
22. N e a re s t N e i g h b o u r S u m mar y B O L D A n a l y t i c a l To o l s
O F
The Nearest Neighbour Summary presents users with an examination of the distance to the nearest neighbour for each of the species in the list of specimens. Distances are highlighted if the nearest neighbour is less than 2% divergent, or when the distance is less than the intraspecific distance. Warnings presented by this tool may be summarized by clicking on the link in the top right corner of the Nearest Neighbour browser. Also, the link to â&#x20AC;&#x153;Summarize by Familyâ&#x20AC;? combines the results for each taxonomic family in the project.
Figure 22-1: BOLD Nearest Neighbour Summary
23. D N A D eg r a d a t i o n Te s t By examining the age of a set of specimens vs. the length of the sequences retrieved, one can determine whether the age of the specimen has any correlation with the quality of the sequence. This comparison may show trends that will help users refine future barcoding, specimen storage and laboratory protocols.
Figure 23-1: BOLD DNA Degradation Test
25
BOLDSYSTEMS.org
B A R C O D E
O F
L I F E
D A T A
S Y S T E M S
24. Ac c u mul a t io n Cu r ve B O L D A n a l y t i c a l To o l s
An accumulation curve of standardized DNA barcodes and related features provides a clear, transparent and reproducible estimate of the diversity and sampling efficiency of areas or collections. Should one wish to characterize a regionâ&#x20AC;&#x2122;s invertebrate fauna, it is clear that we need to accelerate the sampling process if we are to understand how well we have sampled the community. This tool also allows users to quickly compare sampling efficiency at multiple regions by multiple taxonomic levels.
Figure 24-1: BOLD Accumulation Curve
25. Al i g n me n t V iewer
Managing sequence alignments and base calls are a critical step in any barcode analysis. To prevent the inconvenience of importing sequences into 3rd party software to analyze and often edit, BOLD now provides an integrated alignment browser that will include many features popular in other packages. This interface will be extended to support common editing functions and novel visualization features.
Figure 25-1: BOLD Alignment Viewer
BOLDSYSTEMS.org
26
B A R C O D E
L I F E
D A T A
S Y S T E M S
26. We b Se r v ic e s External Connectivity
O F
BOLD Web Services provide the ability to search and retrieve public data from BOLD. These services are built in Representational State Transfer (REST) architecture. This architecture has been chosen for the following reasons:
• • • •
Scalability of component interactions; Generality of interfaces; Independent deployment of components; Intermediary components to reduce latency, enforce security and encapsulate legacy systems
Services currently offered by BOLD are e-Search and e-Fetch.
The services home page can be reached through: http://services.boldsystems.org/ Examples of e-Search and e-Fetch are available on the Services home page.
id_type
sampleid,processid,specimenid,sequenceid,taxid,recordid
ids
(comma seperated id)
return_type text,xml,json file_type
zip
geo_inc
(comma seperated country/province to be included)
geo_exc
(comma seperated country/province to be excluded)
taxon_inc
(comma seperated taxonomy to be included)
taxon_exc
(comma seperated taxonomy to be excluded)
Table 26-1: e-Search parameter descriptions
record_type
specimen,sequence,full
id_type
sampleid,processid,specimenid,sequenceid,taxid,recordid
ids
(comma seperated id OR * for all records)
return_type
text,xml,json
file_type
zip
Table 26-2: e-Fetch parameter descriptions
Search Terms Returned Search results
BOLD eSearch Service
Filter
Retrieved Records
BOLD eFetch Service
Client Side
BOLD side
Figure 26-1: Diagram of BOLD Web Services Workflow
27
BOLDSYSTEMS.org
B A R C O D E
O F
L I F E
D A T A
S Y S T E M S
Notes
BOLDSYSTEMS.org
28
c
L ast m o di f ied: November 2 0 0 9
BOLDSYSTEMS.org Fo r o nline ver sio n, p lease vi si t : w w w.b o ldsyste ms.org /docs / Fo r s u p p o r t wit h any feat ure o f B OL D, p lease em ai l: sup p o r t @ b olds ys tems.org
Bi odi ve r s i t y I n s t i t ute o f O nt a ri o Un i ve r s i t y o f G u e l p h 50 Ston e R o a d E a s t G u e l p h , O nt a r i o, Ca n a d a N1 G 2 W 1 Co py r i ght ©2009 B i o diver sit y I nst i t u te of Onta r io Aut ho red by : M egan M ilto n ( m m ilto n@ b o ldsystem s.o rg ) Tai k a vo n K ö nigslöw ( t vo nko ni@ b o ldsys tems.org ) Sujeevan R at nasi ngham ( srat nasi@ b o ldsys tems.org ) D esigned by : Suz anne B ateso n, B io diver sit y I nst i t ute of Onta r io