Genotypes / Phenotypes Datasets


If you want to support NGYX Non-Profit initiative by advertising on this WEB site Click Here!


NGYX I.C. Genotype/Phenotype(s) Datasets/Subsets.


Our Genotype / Phenotype(s) (GPs) Datasets (for the 22 FDA approved drugs targeting the HIV1 PR & RT) have been equally  divided in 25 Subsets (sizes from > 1,000 up to +/- 2,500). The Subtypes ("Clades") and  Drug Resistance (Log Fold Changes; see below for more info. about Log Fold Changes) were almost equally distributed in between these 25 subsets. For all the 22 drugs we have a dedicated WEB page that pictures out some of the main characteristics of thre full Dataset and also shows that the methodology used to split was really appropriate to populate subsets uniformly.


You can download for all 22 drugs the subset #1 for FREE. and if you want to analyse an extended dataset you can order/buy additionnal subsets. For this purpose we also generated "INFO" files that summarise for all Datasets/Subsets their main characteristics / contents (Clades, Resistances, Mutations). These files are raw data but with very detailled info. that  may be very helpful to help selecting subsets.


Good to know: NGYX I.C. will propose in the next few months a full HIV1 Drug Resistance Diagnostic Service. Also while R&D is ongoing for this service we already identified that if you can capture from a subset e.g. size 2,000 Geno/Phenos numerous "classified Survey mutations", the performances of any prediction  system start to be very good once the dataset is closing to 10,000. After this "break point" adding more GPs  still improves but slowwly and is more dedicated for experts who wants to identify emerging (rare) mutations.   


IMPORTANT: With regards to Data Privacy Legislations NGYX I.C. cannot provide more details especially on the source Genotypes & Phenotypes. Also while all of these data have been obtained from various sources, ALL are from FDA approved datasets.


The picture here below shows the structure / content of the DATA files (e.g. is ABC Subset #1).

Legend of picture.

Column A is the NGYX I.C. Sample ID. Important: For a defined Sample there is only one Genotype but the same Sample can have several Phenotypes when experiment has been repeated. Global numbers are available from INFO files for all Subsets but you can also identify repeat(s) using the Column C in the DATA file.

Column B is the Sampling Date. Means when the blood sample was collected (YYYYMMDD).

Next 3 columns are phenotypic values:

               - Column D is a Log  (base 10) Fold Change with FC = IC50 Mutant / IC50 Wild Type.

               - Column E is a Normalized LogFC = (LogFC - Mean of LogFC) / Standard Deviation of LogFC.

               - Column F is also a Normalized LogFC = (LogFC - Minimal LogFC)/(Maximal LogFC - Minimal LogFC). So it is a percentage 0% to 100% of resistance (for experts: minimum is 0.000001 and maximum 0.999999; you know why...).

Column G is the Genotype. Means a list of Mutations (comma delimited). This is fully explained in a picture here below.

Colmun H is the Subtype (Clade). We used a pool of REGA, NCBI and LANL recommended references (192 nucleotide sequences). The given subtype is the one of the closiest reference sequence in terms of nucleotide identity over the full PR-RT region.

Columns I and J are somewhat special. They are linked to the NGYX I.C. approach/methodology to generate subsets. These will be explained elsewhere (click HERE) but just keep in mind that:

               - Column I is for subsets #1 to #25. Each Subset use only 4 values and these values are specific of the subset. Example: In Subsets #1 you will only find values 1, 26, 51 & 76.

               - Column J is dedicated to FULL datasets analyses with only for values: 1, 2, 3 & 4.


Picture here below helps to explain Phenotypic Values.


Picture here below helps to explain Genotype structure.

Legend of Picture above.

The Picture here above is relatively self innformative. 3 Additionnal remarks:

"X" symbol is used when "explosin" of a degenerated codon leads to more then 2 a.a.. "X" is always associated with factor of 2 as of course there are at least 2  different viral subpopulations.

Wild Type a.a. are never reported (these are not mutations by definition) with one exception: if present in a "Mixture"  Mutation A.A. + Wild Type A.A.

Insertions and Deletions A.A. symbols are always associated with a 1 factor as NGYX I.C. consider these phenomenons as first rank characteristics and not related directly to type of a.a. inserted/deleted.


So it is time to get your subsets (at least the ones for FREE): Click HERE!

NGYX I.C. details / Coordinates.

Company N°: BE 0537.471.159

Postal Adress:  NGYX I.C. (P. LECOCQ ),rue des Hausseurs 10, B-4550 Nandrin, BELGIUM. Email:  Info@NGYX.EU              Tel. / GSM: +32 498 532496 IBAN: BE63 7506 5746 0708   BIC: AXABBE22

Email  us: Click Here!