Integrative Associative Classification for Cancer Biomarker Discovery

Biologics Reshaping Pharmaceutical Market

INTEGRATIVE ASSOCIATIVE CLASSIFICATION FOR CANCER BIOMARKER DISCOVERY

Integrative analysis of microarray data with prior biological knowledge is a promising approach to discover reliable and accurate cancer biomarkers. Associative classification is widely used in data mining and has great potential in cancer biomarker discovery for identifying associated genes with interpretable biological information.

Ong Huey Fang, Lecturer, School of Information Technology, Monash University

Biomarkers or biological markers cover a wide range of substances that can be measured from body tissues, cells, blood or fluid. For instance, a cell expresses genes when they are required for biological processes, and the measurement of gene expressions under different physiological conditions provide essential clues of gene functions. Therefore, biomarkers play essential roles in understanding complex biological mechanisms, as well as in diagnosis, prognosis and

Screening Diagnosis

Detect a cancer at its early stage, when there are no symptoms Identify a cancer from its signs and symptoms

treatment of diseases, such as cancer. The desirable characteristics for ideal cancer biomarkers are non-invasive, low cost, simple to perform, discriminative, informative and produce high accuracy, sensitivity and specificity in clinical applications. Nevertheless, having these ideal characteristics remain as challenges in cancer biomarker discovery. Low specificity is the situation when a test yields false-positive results, causing unnecessary anxiety and treatment to a patient. While,

Staging Prognosis

Determine the extent a cancer has spread within the body Assess possible outcomes of a cancer, such as chances of survival, responses to treatment, and the likelihood of recurrence

low sensitivity is the situation when a test yields false-negative results, which cause a false sense of security to a patient. Figure 1 shows the possible applications of cancer biomarkers and their respective role.

Biomarker discovery is the process of identifying and measuring the intrinsic features of high-throughput molecular profiling data, such as microarray data, or also known as gene expression data. Microarray data analysis is a powerful preclinical exploratory study for discov

Prediction Monitoring

Predict responses to different treatments Monitor cancer recurrence and therapeutic responses

ering potentially useful biomarkers. A microarray is an ordered collec- tion of biological materials printed onto a small solid substrate such as membrane and glass slide. The common type of array is DNA microarray, which is a glass slide with thousands of spots or probes. Each fixed spot contains identical DNA molecules that correspond to a gene. Microarray data is used to analyse gene expressions within a single sample or to compare gene expressions in two different cell types or tissue samples, such as between healthy (normal condition) and diseased tissues (test condition). Small in size with a large number of genes, microarray has become an indispensable tool to assay the expression levels up to thousands of genes simultaneously in a single experiment.

Some of the well-known microarray platforms include AffymetrixGeneChip, Agilent, and Illumina Bead Chip. Although a microarray’s platform design is subject to include only known genes, it is still considered less biased compared to other high-throughput technologies. Besides, it could provide hints about functional relationships and interactions among genes. Its sensitivity allows detection on very low expressions of mRNA. Examples of microarray-based biomarkers that have been approved for clinical tests include MammaPrint, Roche AmpliChip, Rotterdam Signature, ColoPrint, and NuvoSelect. Despite these promising applications, microarray-based biomarkers are not widely used by either organisation issuing clinical guidelines or expert panels. Remaining challenges that need to be addressed in microarray data analysis are such as cost of development, false-positive errors, data quality, data qualification and interpretation. Besides, lacking organised and integrated resources have worsened the situation.

MICROARRAY DATASET DATA TRANSFORMATION

GENE SELECTION

Figure 2.The generic flow for microarray data classification

CLASSIFICATION PERFORMANCE EVALUATION

Data mining provides a wide range of methodologies and tools in microarray data analysis to uncover novel cancer biomarkers and to understand the underlying genetic causes of cancer diseases. Its capability to cope with high-dimensional data make it more preferable compared to conventional statistical methods. Typical microarray data analysis with data mining techniques includes gene selection, classification and association rule mining. The main goal of microarray data classification is to build an efficient and effective classifier that is capable of differentiating gene expression profiles for accurate disease diagnosis or prediction. However, due to a large number of genes and small samples size, traditional statistical and classification techniques are not able to deal with it efficiently, leading to false-positive and overfitting problems, as well as reducing the accuracy and speed of classifiers. Therefore, gene selection, or also known as feature selection, is an essential task for microarray data classification to identify differentially expressed genes and to reduce dimensionality by removing irrelevant, redundant and noisy data. The k-nearest neighbours, naïve bayes, logistic regression and support vector machine are popular classification algorithms used for microarray data classification. These classifiers have shown good performances when combined with suitable gene selection. Figure 2 presents the generic flow for microarray data classification.

Along with that, Associative Classification (AC) has gained popularity in microarray studies to classify data and to uncover to discover interesting biological relationships from large microarray datasets. Such information is indeed important to extract relevant gene markers, which in turn can create more reliable and accurate cancer diagnosis. AC is a hybrid approach that integrates both classification and Association Rule Mining (ARM). In data mining, ARM is also known as association analysis or frequent pattern mining. ARM is widely used in businesses to predict customer purchasing behaviours called market basket analysis. It works by analysing customer transactions to identify frequently co-occurred items. The same idea can be applied to analyse the human genome, which contains about 20,000 to 30,000 of genes as interacting items. Classification based on associations approach aims to identify a subset of rules, known as Class Associative Rules (CARs), whose consequent are restricted only to target class labels. The classifiers built by existing ACs had proven to produce better accuracy and could improve the understandability and reasoning of a problem. With the assumption that a good classification resulted from a good biomarker identification or the other way round, our study attempts to achieve this goal by using the AC approach. The common strategy in AC is to decompose a problem into two major tasks, as shown in Figure 3. The first task is to generate a complete set of CARs that satisfy userdefined values, such as minimum support and minimum confidence. The second task is to build a classifier based on the strongest CARs

Discretisation

GENERATE CARS

Rule Reneration Pruning

Figure 3. The block diagram of an associative classification framework

BUILD CLASSIFIER USING CARS

Rule Selection Training Testing

The problem of mining microarray data with AC can be described more formally. Let T = {t1, t2, …, tm} be a dataset or transaction database that contains all transactions. Let G = {g1, g2, …, gn} be the set of all items found in T, and let C = {c1, c2, …, ck} be the set of class labels. Each transaction ti consists of a set of items X, which has been labelled with a class y, such that X G and y C. In other words, a transaction represents a set of expressed genes for a sample or patient, and a transaction database includes all gene expressions recorded from a microarray experiment. Below are the definitions for some of the terms in AC: • Transaction database:Figure 4 (a) illustrates an example of a microarray data matrix with relative gene expression values. The genes are treated as items, while the array of samples are treated as transactions. As the microarray data consists of continuous attributes, it needs to be discretised and treated as categorical attributes. Figure 4 (b) shows an example of microarray data that had been discretised into a binary matrix. In discretisation, a certain cutoff value is applied, where an expression value that is above the cutoff value is considered highly expressed and is assigned a value of 1. Otherwise, it is considered highly repressed and is assigned a value of 0. Discretised microarray data matrix can be further transformed to a transaction table, as shown in Figure 4 (c). • Item: An item is an attribute-value pair of the form (gi, v), where gi G is an attribute taking a value, v, such as an expression value. In certain cases, an attribute can also take multiple values. • Itemset: An itemsetX, where X G is a set of zero or more items. A k-itemset is an itemset that contains k items, such as {Gene A, Gene C, Gene D} is a 3-itemset. • Class Associative Rule: A class associative rule is an implication expression of form X y, where X G, y C, and G C = . The left-hand side itemset is known as the antecedent,

Figure 4. (a) Microarray data matrix with gene expression levels, (b) Microarray data matrix discretised into binary notation, and (c) Transaction table of a microarray data

while the right-hand side itemset is known as the consequent. The consequent of a CAR must be an itemset with a single item from the class label of set C. X and y are non-overlapping item sets. Support: The support of a CAR, X y is defined as the percentage of transactions in the database that contain X and is associated with class y, which is supp(X Y) = count(X y) / count(X). Support is an important indicator to show the frequency of occurrence for a rule. Non-frequent rules that did not satisfy user-defined minimum support can be pruned out as their occurrence could be simply due to chance.

• Confidence: The confidence of a CAR, X y is the percentage of transactions in database that contain X also contain class y, which is conf(X Y) = count(X Y) / count(X). Confidence indicates the predictability and reliability for a rule. Therefore, rules that did not satisfy user-defined minimum confidence can be pruned out. In the past few years, there are also efforts directed toward knowledge-driven or

Genomic

(e.g. gene ontology)

Transcriptomic

(e.g. microarry)

Proteomic

(e.g. protein-protein interaction)

Metabolomics

(e.g. KEGG Pathway)

Data Integration

Integrated data from different sources

Biomarker Selection

Ranked rules with candidate biomarkers

Optimisation

Xlssificatin to select strong biomarkers

Evaluation

Predictability, Reproduciility and Interpretability

Figure 5. An integrative analysis framework with associative classification

integrative analysis, which combines microarray data with other data sources, such as prior biological knowledge. An integrative analysis in data mining means to combine heterogeneous data, information and knowledge for the generation of higher-level knowledge and new testable hypotheses. An integrated system of multiple types of high-throughput functional genomic data is expected to facilitate fast, accurate, systematic identification and prediction of highly complex biological data. To achieve that, sophisticated computational and analytical methods are needed to overcome current challenges by increasing the sensitivity and specificity in high-throughput data. Diverse genomic data such as gene ontologies, protein-protein interactions and KEGG pathways are integrated by computational methods to create new integrated data with functional relationships between genes. These integrated data can then be used for microarray data classification, and evaluation is done using cross-validation or using a test set of labelled data.

Figure 5 presents the proposed integrative analysis framework for the integration of heterogeneous biological data (e.g. transcriptomic, genomic, proteomic, and metabolomics data), with the processing components of data integration, biomarker selection, optimisation, and evaluation. An associative classification algorithm is adopted to generate the desired number of association rules with the highest support that meet minimum confidence. Several modifications are introduced for mining informative association rules from both microarray and biological transaction tables. With the framework, the top-k CARs generated from different target classes of microarray transaction tables were integrated,

and a new ranking algorithm was applied to select a set of strong rules that contain potential biomarker genes that can discriminate classes. An interestingness measurement is proposed to rank the CARs, where the interestingness for a rule is the sum of the information-content from three observations, namely the average information gain, average classification accuracy and modified enrichment score. The most top-ranked class associative rule is considered the most informative rule with the lowest interestingness score, and its genes are considered the most informative genes. Gene subsets are generated from the informative genes of the top-ranked rules, and only the most informative gene subset inputted for the training of classifiers. The best gene subset is the set of genes that can achieve the highest predictive accuracy with less number of genes. The evaluation of the selected biomarkers is based on their predictability, reproducibility and interpretability performances. Also, results obtained were evaluated and compared with other existing methods to determine whether the research problem is resolved or not.

The proposed framework had been tested on four UCI datasets and eight microarray datasets of colorectal and breast cancers. In comparison with other existing methods, it outperformed in terms of classification accuracy and Area Under the Curve (AUC), as well as showing significant reproducibility and interpretability results. These promising results have proven that the proposed method is capable of identifying potential genes, which can be further investigated as biomarkers for specific cancer diseases. The experimental results can be found in the paper Informative top-k class associative rule for cancer biomarker discovery on microarray data. For future works, multi-platform microarray data can be integrated into the same integrative analysis to produce more reliable and accurate biomarker discovery. Moreover, an improved AC method can be introduced to increase the efficiency of rule mining.

AUTHOR BIO

Ong Huey Fang is specialised in Intelligent Computing. Her interests are in AI, bioinformatics, and software engineering. Her recent research centres on using intelligent techniques to identify cancer biomarkers through omics data. She has also worked on intelligent systems related to ML, speech recognition, gesture recognition, and AR.

150 years of experience meet newly gained independence. The result is fl exibility that will convince you from day one. Of course we remain your reliable partner for the pharmaceutical and food industries worldwide. Bosch Packaging Technology becomes Syntegon. The new name

for processing and packaging technology.

Visit our

virtual exhibition booth

7. – 13. May 2020

syntegon.com/ VirtualShow

Syntegon. Formerly Bosch Packaging Technology.

TOC for Cleaning Validation Getting started…

Rohit Chakravorty, Lead Application Specialist, SUEZ - Water Technologies & Solutions Michelle Neumeyer, Life Sciences Product Applications Specialist, SUEZ - Water Technologies & Solutions

1. What are some of the advantages of the various analytical methods for cleaning validation?

Analytical methods for cleaning validation can be broadly classified into two categories: specific and non-specific methods. Specific Methods: Analytical methods that provide information about (or quantitate) only a specific ingredient in the formulation, commonly the active pharmaceutical ingredient (API). • The key advantage is that these methods provide quantitative information about the target analyte. • Examples of non-specific methods include

chromatographic methods (HPLC, UPLC, GC), spectroscopic methods (MS, UV-visible, atomic absorption), electrophoresis methods, ELISA, etc. • However, with ever changing regulatory requirements, the specific methods listed above do not achieve the most comprehensive and compliant monitoring program. The global regulations are now expecting cleaning validation to encompass data inclusive of all potential contaminants, not just the API. Active ingredient, excipients, degradants, cleaning agents and detergents should be included as part of a monitoring program. • Specific methods should be used where the risk to the product is moderate to low and when

it is not feasible to perform visual inspection and non-specific testing. Non-Specific Methods: Analytical methods that provide information about (or quantitate) the entire formulation. • The key advantage is that these methods give a quantitative value of the entire formulation, degradants, and cleaning agents combined for a more comprehensive understanding of cleanliness. • Examples of specific methods include total organic carbon (TOC), conductivity, gravimetric, pH, etc. TOC and conductivity are the most common non-specific methods deployed for cleaning validation, verification, and monitoring. • Non-specific methods like TOC have more sensitive limits of detection (LOD) as compared to specific methods like HPLC. TOC provides greater sensitivity while also capturing a complete picture of the cleaning process. TOC analysis is suitable for high to low risk products.

2. What are the risks of only measuring API in a cleaning validation program?

API is a very small part of the entire formulation and doesn’t necessarily equate to the most toxic or hardest to clean. Thus, measuring something that is only 1/10th or 1/15th of total ingredient present inside the production equipment is a big risk. Excipients and degradants can be left undetected when using a specific method.

If the active ingredient is degraded during the cleaning process, it is quite difficult to quantify the API using specific methods. For example, degradants may have different properties than the intact API and may not elute with the same profile, thus going undetected. With non-specific methods such as TOC, degradants would be detected and quantified.

3. For manufacturers currently using a product-specific method for cleaning validation, such as HPLC, what are the steps needed to transition to TOC?

First, a feasibility study is required for the compounds of interest to determine appropriate recovery using TOC analysis. This is to ensure that TOC is suitable for performing cleaning validation with these compounds of interest. Next, TOC should be qualified as a method per USP <1225> or ICH Q2(R1) quantitative method validation guidelines. Finally, product limits should be converted to TOC limits

and the recovery factor should be considered. SUEZ offers numerous resources to ensure successful implementation of these steps.

4. How difficult is method development for TOC for cleaning validation, particularly related to calculating Maximum Allowable Carryover (MAC) and determining worstcase residues?

Method development is straightforward with Sievers TOC Analyzers. Acid and oxidiser flow rates need to be determined depending on the sample concentration. If concentrations are unknown, there is no need to worry – the Sievers Autoreagent feature is designed to use the correct acid and oxidiser flow rates based on sample matrix. This adds a few additional minutes to the process but is still faster than conventional specific methods.

Nothing changes for determining worst-case compounds and performing MAC calculations. The only additional step is multiplying the final limit by the % carbon of the worst-case compound to obtain a TOC acceptance criteria.

SUEZ Applications Specialists are experienced in guiding customers in successful transitions from HPLC to TOC for cleaning validation. Support may include help with TOC limit calculations and optimising protocols for soluble and insoluble API.

5. Can TOC be used to detect compounds that are insoluble in water?

Using TOC requires an aqueous sample solution. That being said, many compounds that are considered hard to solubilise or even insoluble can be detected by TOC with suitable method development. Introducing heat, agitation, or adjusting the pH of a sample can greatly increase the solubility of a compound so that it can be readily detected using TOC analysis. As a part of method development, it is essential to demonstrate proper recovery of these compounds by running protocols to demonstrate linear recovery.

6. What types of detergents or cleaning agents can be detected using TOC?

TOC is commonly used to detect trace residues from many detergents and cleaning agents including those with acidic, alkaline, or oxidative matrices. TOC and conductivity may also be used together to analyse cleaning agents that exhibit both a conductivity and TOC response for better understanding of both the ionic and organic cleanliness. In a study conducted in the SUEZ Applications Laboratory, CIP 100, CIP 200, alkaline and neutral detergents, sterilant sporicidal agents, and quaternary ammonium cleaners were analysed using Sievers TOC Analysers. These all exhibited linear recoveries using several concentrations, establishing the suitability of TOC for detecting residual cleaning agents and detergents.

7. When using TOC for cleaning validation, how fast can manufactures obtain data and make decisions for equipment release?

It depends on how the analysis is being performed. Traditional grab sampling and laboratory analyses can be time consuming and take days to release equipment. At-line, wherein a portable analyser is in close proximity to the process, can reduce workflow and laboratory delays, allowing for equipment release within minutes. Online analysis with Sievers TOC, wherein an analyser is directly integrated with the process, allows for real-time data and real-time equipment release. Analysis itself takes a couple of minutes for each sample in standard mode. For processes with time constraints or in the case of profiling a cleaning process, Turbo mode can be used to gather data faster. Turbo mode gives a data point every four seconds, allowing for huge efficiency gains in a cleaning validation program.

Rohit Chakravorty is the Lead Application Specialist with SUEZ - Water Technologies & Solutions responsible for providing application support in India, SAARC and ASEAN region for the Sievers product line and TOC applications, notably in cleaning validation and real time release testing (RTRT) of pharmaceutical water. Prior to joining SUEZ, Rohit was an Application Specialist with Sysmex and a Research Assistant at Sun Pharmaceutical in India. Rohit holds a Masters in Biotechnology (Gold Medalist) from Padmashree Dr. D.Y.Patil University in India.

Michelle Neumeyer is the Life Sciences Product Applications Specialist for the Sievers line of analytical instruments at SUEZ - Water Technologies & Solutions. Previously, Michelle has worked in Quality at Novartis and AstraZeneca, ensuring compliant water systems, test methods and instrumentation. Michelle has a B.A. from University of Colorado, Boulder in Molecular, Cellular and Developmental Biology.

8. What are some of the greatest efficiency gains manufacturers can achieve with TOC for cleaning validation?

TOC enables manufacturers to deploy Process Analytical Technology (PAT) for a cleaning validation program. Using at-line or online technology for cleaning validation greatly increases the efficiencies of a monitoring program. Not only are data and equipment released in real-time, but time spent on sampling, analysis, and human error investigations are greatly reduced using PAT applications. Furthermore, TOC can aid in optimisation of the cleaning process itself. Using TOC data, the amount of water, detergents, and time may be reduced based on process profiling capabilities of online cleaning validation deployment. Finally, as mentioned in the previous question, obtaining data every four seconds with Turbo mode can significantly aid in process optimisation.

UPGRADE YOUR MARKETING STRATEGY

Let the true “Digital Transformation” be the base of all your marketing campaigns

AFTER COVID-19 PANDEMIC “Digital Transformation is the way forward”

Highly accountable marketing campaigns. Every dollar counts. Digitally powered marketing campaigns may be cheaper than you thought... Ask us how?

Use the webinar as a platform to launch new products and services Grow your audience with increased reach, impact and user-friendliness Rise above geographical boundaries Generate new business Gain the strong web presence differentiating yourself from competitors Connect and engage with your target audience Give more exposure to industry specific people Increase your brand profile and share your capabilities with leading industry professionals

Email : advertise@pharmafocusasia.com

Website : www.pharmafocusasia.com

Next Article

Biologics Reshaping Pharmaceutical Market

Screening Diagnosis

Staging Prognosis

Prediction Monitoring

MICROARRAY DATASET DATA TRANSFORMATION

GENE SELECTION

CLASSIFICATION PERFORMANCE EVALUATION

GENERATE CARS

BUILD CLASSIFIER USING CARS

Genomic

Transcriptomic

Proteomic

Metabolomics

Data Integration

Biomarker Selection

Optimisation

Evaluation

AUTHOR BIO

for processing and packaging technology.

virtual exhibition booth

syntegon.com/ VirtualShow

Syntegon. Formerly Bosch Packaging Technology.

TOC for Cleaning Validation Getting started…

1. What are some of the advantages of the various analytical methods for cleaning validation?

2. What are the risks of only measuring API in a cleaning validation program?

3. For manufacturers currently using a product-specific method for cleaning validation, such as HPLC, what are the steps needed to transition to TOC?

4. How difficult is method development for TOC for cleaning validation, particularly related to calculating Maximum Allowable Carryover (MAC) and determining worstcase residues?

5. Can TOC be used to detect compounds that are insoluble in water?

6. What types of detergents or cleaning agents can be detected using TOC?

7. When using TOC for cleaning validation, how fast can manufactures obtain data and make decisions for equipment release?

8. What are some of the greatest efficiency gains manufacturers can achieve with TOC for cleaning validation?

UPGRADE YOUR MARKETING STRATEGY

Let the true “Digital Transformation” be the base of all your marketing campaigns

AFTER COVID-19 PANDEMIC “Digital Transformation is the way forward”

More articles from this publication:

Biologics Reshaping Pharmaceutical Market

COVID-19

Next Generation Biorepositories for Transformative Medicine

SBV Technology and Eradicating the Risk of Contamination in Aseptic Manufacturing

COVID-19 Pandemic Risk assessment and future control

Intelligent Nanomaterials in Pharmaceutical Analysis A vision of future

Sustainable Packaging

Virus Outbreak Impact on pharma industry

Finding a Path to Global End-to-End Labelling

This article is from:

Pharma Focus Asia - Issue 39