Example Problem: Classify proteins by their sequence

Hi everyone,
I need to classify proteins based on their amino acid sequence. An amino acid sequence is a sequence of letters of varying length. There are five classes: a, b, a/b, a+c, other,

I know the class of some proteins. You can find them here. The first entry in each line is the protein ID and the class is given after “cl=”. They are encoded as 1000000 - 1000004.

The sequences of the proteins (both of known class and unknown) can be found here. You can match them with the protein ID in the lines starting with “>”. The actual sequence is in the line below.

I have a Windows PC and have only little experience with such tasks.

Important: This is an example question to give an idea what kind of questions can be asked about microscopy images and how the follow up interactions could look. The data at had is actually fully labelled. I am not the owner of the data. The links point to the SCOP database - release 2020-02-28 available at http://scop.mrc-lmb.cam.ac.uk/

Andreeva, Antonina, et al. “The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures.” Nucleic acids research 48.D1 (2020): D376-D382.

Andreeva, Antonina, et al. “SCOP2 prototype: a new approach to protein structure mining.” Nucleic acids research 42.D1 (2014): D310-D314.

The following Kaggle challenge is of similar spirit to the example problem: https://www.kaggle.com/googleai/pfam-seed-random-split.


Hello! Can you tell us a bit on how this relates to COVID-19?

Hi lesolorzanov,
[this is an example problem to illustrate possible types of problems that could be posed here. I have no idea if and if so how the question relates to COVID-19, but I will make something up to keep this example exchange going.]
we have a long list of candidate proteins that might be useful in combating COVID-19 and expect that the protein class will be a good predictor for whether they are actually useful. So knowing the classes helps us to focus our further steps on the more promising candidate proteins.

Hi @lesolorzanov, great question. @sdamrich, great answer! :wink:
We are in an early phase here and are simply trying to get some demo entries into the otherwise empty forum. We need that in order to being able to showcase the platform we are setting up to others… you know… seeing is so much better then listening to long and abstract explanations only.

Thanks you both for being active here!

Hi @sdamrich!

Could you explain better the structure of the first file you posted? Is the protein ID given by the first 9 columns of each line?
Moreover, for what I can see, the last column provides some extra information, but the only relevant one for this problem is given by the protein class. Is it correct?

Thank you!

Hi @alberto.bailoni,

thanks for having a closer look at the data! You are entirely right, I should have been more precise. In this file each row contains several identifiers from different protein databases and also some that identfiers for the protein (super-) family. You can safely ignore them all and match protein classes and sequences between the two files by the identifiers in the first column of the respective files.
Regarding your second question: Yes, there is more information than just the class available. For instance the type and fold (tp, cf) of the protein. We only need to predict the protein class for our task but if you can use the additional information to improve the classification feel free to do so.
Thanks a lot for your help everyone!