I need to classify proteins based on their amino acid sequence. An amino acid sequence is a sequence of letters of varying length. There are five classes: a, b, a/b, a+c, other,
I know the class of some proteins. You can find them here. The first entry in each line is the protein ID and the class is given after “cl=”. They are encoded as 1000000 - 1000004.
The sequences of the proteins (both of known class and unknown) can be found here. You can match them with the protein ID in the lines starting with “>”. The actual sequence is in the line below.
I have a Windows PC and have only little experience with such tasks.
Important: This is an example question to give an idea what kind of questions can be asked about microscopy images and how the follow up interactions could look. The data at had is actually fully labelled. I am not the owner of the data. The links point to the SCOP database - release 2020-02-28 available at http://scop.mrc-lmb.cam.ac.uk/
Andreeva, Antonina, et al. “The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures.” Nucleic acids research 48.D1 (2020): D376-D382.
Andreeva, Antonina, et al. “SCOP2 prototype: a new approach to protein structure mining.” Nucleic acids research 42.D1 (2014): D310-D314.
The following Kaggle challenge is of similar spirit to the example problem: https://www.kaggle.com/googleai/pfam-seed-random-split.