Consistent blind protein structure generation from NMR chemical shift data

Shen et al. 10.1073/pnas.0800256105.

Supporting Information

Files in this Data Supplement:

SI Figure 5
SI Table 3
SI Figure 6
SI Figure 7
SI Figure 8
SI Figure 9
SI Figure 10
SI Table 4
SI Figure 11
SI Table 5
SI Figure 12
SI Figure 13





Fig. 5. Plots of fragment accuracy for GB3. For each specific GB3 segment, 200 fragment candidates were selected using either the standard ROSETTA procedure (filled triangles), or from an MFR search of the 5665-protein structural database, assigned by the programs DC (filled circles), or SPARTA (filled diamonds). Like SPARTA, DC also can readily assign chemical shifts to a large database of protein structures, but the error in predicted chemical shift is on average slightly worse than for SHIFTX, and about 17% worse than SPARTA. For all panels, coordinate rmsds (N, Ca, and C') between query segment and selected fragments are normalized with respect to randomly selected fragments (i.e., the average rmsd between this target fragment and 1200 randomly selected fragments of the same length). The averaged rmsd of the 200 selected fragments is plotted as a solid line; dotted lines represents the lowest rmsd (best fragment out of 200). Average (A) and lowest (B) rmsd of 200 selected fragments, as a function of fragment size, relative to the NMR coordinates of the corresponding GB3 segment, averaged over all (overlapped) consecutive segments. (C and D) Average rmsd of 200 nine-residue (C) and three-residue (D) fragments relative to the x-ray coordinates, as a function of position in the GB3 sequence. (E and F) Lowest rmsd of any of these selected nine-residue (E) or three-residue (F) fragments.





Fig. 6. Comparison of results obtained with standard ROSETTA and CS-ROSETTA for ubiquitin and GB3. All atom energy versus Ca rmsd of the ROSETTA models obtained using standard sequence based ROSETTA-selected fragments (Upper) and chemical shift based MFR-selected fragments (Lower) for ubiquitin (Left) and GB3 (Right). All-atom energies correspond to the raw ROSETTA energy score, before rescoring using experimental chemical shifts





Fig. 7. Plots of ROSETTA all atom energy versus Ca rmsd relative to the experimental structures for proteins of Table 1, not presented in Fig. 2. For each of these proteins, the upper plots show the standard ROSETTA all atom energy versus Ca rmsd from the experimental structures (see SI Table 3), and the lower plots show ROSETTA all atom energy rescored by using the experimental chemical shifts (cf. Eq. 1). The model with the lowest energy, marked by an arrow, is shown either in Fig. 3 or SI Fig. 9.





Fig. 8. Plot of c2cs score (Eq. 1b) of CS-ROSETTA models versus Ca rmsd relative to the experimental structures for proteins listed in Table 1.





Fig. 9. Backbone ribbon representations of the lowest-energy CS-ROSETTA model (red), superimposed on the experimental x-ray/NMR structures (blue) for the proteins listed in Table 1, with superposition optimized for ordered residues, as defined in the footnote to SI Table 3. Overlays of the 6 remaining structures are shown in Fig. 3.





Fig. 10. Plots of ROSETTA all atom energy versus Ca rmsd of CS-ROSETTA models relative to the lowest-energy models for each of the 16 test proteins of Table 1.





Fig. 11. Plots of ROSETTA all atom energy versus Ca rmsd of CS-ROSETTA models for the 7 proteins of SI Table 4, for which no convergence was obtained. For each protein, the upper panel presents the chemical-shift-rescored ROSETTA all-atom energy versus the Ca rmsd from the experimental structure; for the lower panels the Ca rmsd is calculated versus the coordinates of the lowest-energy model, whose energy is marked as a bold dot on the y axis. For nsp1 protein, the lowest-energy model is the only one out of 12,000 generated models that has the same topology as the experimental NMR structure, and even then it deviates considerably (backbone rmsd of 5.1 Å) from the experimental NMR structure.





Fig. 12. CS-ROSETTA structures generated for five structural genomics targets (Table 2). The remaining four are shown in Fig. 4. (A-E) Superposition of lowest-energy CS-ROSETTA models (red) with experimental NMR structures (blue), with superposition optimized for ordered residues, as defined in the footnote to SI Table 5. (A'-E') Plots of rescored (Eq. 1) ROSETTA all-atom energy versus Ca rmsd, calculated relative to the lowest-energy model (bold dot on y axis). (A and A') TR80; (B and B') RhR95; (C and C') PsR211; (D and D') AtR23; (E and E') NeR45A.





Fig. 13. Accuracy of the models in subsets randomly selected from the final ROSETTA all-atom models. For each protein (Table 1 and SI Table 3), the Ca rmsd values (relative to the experimentally determined reference structure) of the lowest-energy models in 100 randomly selected 5-, 50-, 100-, 1000-, 5,000-, and 10,000-sized subsets from the final ROSETTA all atom models were calculated and these averaged values are plotted against the size of the subsets. The figure shows that for 13 of the 16 proteins, generation of 5,000 ROSETTA full atom models suffices to yield a lowest-energy model that differs by £0.2 Å from the lowest-energy models obtained by using 10,000-20,000 ROSETTA predictions (Table 1).





Table 3. Full survey of converged protein structures generated by CS-ROSETTA

 

Protein name

PDB*/BMRB ID

Na/Nb

Nall

Ncs§

RMSDmean¶ [Å]

RMSDexp║ [Å]

Backbone

All

Backbone

All

GB3

2OED

14/26

56(1-55)

332

0.25±0.08

0.48±0.11

0.74±0.05 (0.69)

1.43±0.05 (1.34)

CspA

1MJC/4296

0/33

70(4-70)

405

0.96±0.23

1.44±0.19

1.43±0.29 (1.08)

2.25±0.33 (1.74)

Calbindin

4ICB/390

47/0

75(3-74)

435

0.68±0.23

0.90±0.21

1.39±0.11 (1.20)

2.13±0.07 (1.92)

Ubiquitin

1D3Z

18/25

76(2-72)

426

0.34±0.11

0.76±0.12

0.82±0.06 (0.75)

1.59±0.14 (1.40)

XcR50

1TTZ/6363

28/16

76(3-73)

352

0.98±0.32

1.37±0.39

1.67±0.27 (1.34)

2.13±0.50 (2.06)

DinI

1GHH

36/21

81(1-77)

463

0.90±0.24

1.16±0.25

1.73±0.25 (1.54)

2.38±0.14 (2.07)

HPr

1POH

29/23

85(2-83)

419

0.95±0.32

1.28±0.35

1.30±0.43 (0.93)

1.99±0.37 (1.54)

MrR16

1YWX/6799

23/35

88(2-81)

514

0.73±0.18

1.03±0.19

1.77±0.22 (1.61)

2.40±0.21 (2.17)

TM1112

1O5U/5357

10/52

89(4-88)

524

1.06±0.26

1.55±0.22

1.58±0.16 (1.16)

2.30±0.14 (1.70)

PHS018

2GLW/7116

20/41

92(6-88)

531

1.12±0.31

1.51±0.28

1.56±0.26 (1.08)

2.27±0.20 (1.69)

HR2106**

2HZ5/6210

37/25

96(2-92)

470

0.80±0.26

1.10±0.22

1.85±0.27 (1.47)

2.58±0.23 (2.14)

TM1442

1SBO/5921

41/23

110(5-109)

647

0.66±0.31

1.02±0.29

1.22±0.27 (1.01)

1.90±0.20 (1.60)

Vc0424

1NXI/5589

55/25

114(2-112)

679

0.88±0.16

1.34±0.17

1.74±0.09 (1.35)

2.53±0.11 (2.04)

Spo0F

1SRR/5899

55/25

121(2-115)

590

1.09±0.21

1.41±0.22

1.67±0.19 (1.26)

2.30±0.13 (1.80)

Profilin

1PRQ

41/41

125(2-123)

595

1.04±0.31

1.46±0.35

2.26±0.35 (2.02)

2.88±0.34 (2.49)

Apo_lfabp

1LFO/4098

15/70

129(5-126)

688

1.36±0.35

1.64±0.30

1.72±0.55 (1.12)

2.33±0.43 (1.68)

* Proteins for which experimental structures were obtained by X-ray diffraction are in italic; for proteins solved by NMR the first model of the NMR ensemble is used as the experimental reference structure.

Number of residues in a-helix and b-strand.

Total number of residues. Numbers of the first and last residue involved in secondary structures are listed in parentheses; these and all intervening residues were used to superimpose structures and to calculate the RMSD values of the predicted models relative to experimental structures. For cspA, residues 39 to 46 in the flexible loop are excluded for RMSD calculation.

§ Total number of the backbone chemical shifts used for the structure prediction; no d13C' available for XcR50, Hr2106 and Spo0F; no d1HN available for Profilin.

RMSD between the 10 lowest-energy models and the mean coordinates for all backbone Ca, C' and N atoms (referred as "Backbone"), and all non-hydrogen atoms ("All").

RMSD between the 10 lowest-energy models and the experimental structure. The RMSD of the mean coordinates of the 10 lowest-energy models and the experimental structures are listed in parenthesis.

** Protein HR2106 is a homo-dimer, only the monomer conformation is predicted by CS-ROSETTA and used for comparisons.





Table 4. Survey of proteins for which CS-ROSETTA did not meet convergence criteria

 

Protein name

PDB*/BMRB code

Na/Nb

Nall

Nshifts§

Carmsd, Å

Lowest RMSD

Lowest Energy

HI0719

1J7H/5606

40/30

130 (3-129)

733

4.50

14.31

MTH1598

1JW3/5165

32/47

140 (4-139)

830

3.65**

12.17**

HR1958

1TVG/6344

8/73

140 (4-139)

829

9.37††

16.29††

CcR19

1T17/6120

37/59

148 (2-144)

842

3.67

7.09

YwIE

1ZGG/6460

68/21

150 (2-145)

851

3.72

9.37

Flua

1N0S/5756

26/83

173 (2-163)

1022

5.54

15.57

nsp1

2GDT/7014

17/33

116 (2-112)

609

5.16‡‡

5.16‡‡

* Proteins with reference X-ray structures are in italic; for proteins solved by NMR the first model of the NMR ensemble is used as the reference structure.

Number of residues in a-helix and b-strand.

Total number of residues. The first and last residue numbers of the secondary structures are listed in parenthesis; Numbers of the first and last residue involved in secondary structures are listed in parenthesis; these and all intervening residues were used to calculate the RMSD values of the predicted models relative to experimental structures.

§ Total number of backbone chemical shifts.

Ca RMSD (relative to the experimental reference structures) for the models with the lowest RMSD and lowest energy.

Residues 7 to 20 and 31 to 45, which are in flexible loops, are excluded for the RMSD calculation.

** Residues 39 to 47 and 104 to123, in flexible loops, are excluded for the RMSD calculation.

†† Flexible loop residues 17-38 are excluded for the RMSD calculation.

‡‡ Flexible loop residues 63-73 are excluded for the RMSD calculation.





Table 5. Survey of protein structures generated by CS-ROSETTA and independently by the NESG consortium

 

 

Protein name

RpT7

StR82

RhR95

NeT4

TR80

VfR117

PsR211

AtR23

NeR45A‡‡

UniProt ID

Q6N4D8_RHOPA

Q04822_SALTY

Q3IZ23_RHOS4

Q82V59_NITEU

RLX_METTH

Q5E7H1_VIBF1

Q885L4_PSESM

Q8UEE9_AGRT5

Q82VF2_NITEU

PDB/BMRB ID

2jtv

2jt1

2jvm

2jv8

2jxt

2jvw

2jva

2yja

2jxn

Protein Size *

65(2-63)

69(5-69)

72(22-68)

73(3-66)

78(5-77)

80(15-75)

100(2-100)

101(2-78)

147(16-143)

M.W [kDa] *

7.8

8.0

8.5

8.7

9.8

10.2

11.6

10.8

15.4

Na/Nb

38/15

36/10

4/19

11/18

23/31

43/0

29/21

11/25

41/52

NCS

345

400

405

429

357

468

589

569

765

Predicted models

                 

RMSDbb/RMSDall§, [Å]

0.73±0.10

1.25±0.18

0.24±0.09

0.53±0.13

0.68±0.26

1.26±0.26

0.47±0.15

1.05±0.15

0.44±0.11

0.84±0.11

0.68±0.16

1.15±0.22

1.34±0.27

1.72±0.24

1.19±0.67

1.73±0.65

0.83±0.17

1.29±0.14

Ramachandran plot¶,§, [%]

98/2/0/0

98/2/0/0

95/5/0/0

90/10/0/0

96/4/0/0

96/4/0/0

95/5/0/0

96/4/0/0

95/5/0/0

Procheck G-factor§, F&Y/All

0.20/0.38

0.47/0.56

-0.26/0.11

-0.13/0.21

-0.1/0.16

0.50/0.56

0.11/0.27

-0.12/0.20

-0.01/0.21

MOLPROBITY clash score§

6.71

7.28

4.40

1.98

3.62

4.50

6.38

4.41

3.34

DP score§, [%]

69

65

55

57

67

37

57

60

53

NMR ensembles

                 

RMSDbb/RMSDall§ [Å]

0.32±0.05

0.97±0.09

0.50±0.09

1.02±0.10

0.50±0.11

0.91±0.11

0.42±0.07

0.94±0.09

0.42±0.08

0.87±0.08

0.59±0.10

1.17±0.11

0.58±0.10

0.96±0.10

0.42±0.08
0.89±0.09

0.70±0.08

1.22±0.07

Ramachandran plot¶,§, [%]

97/3/0/0

97/3/0/0

92/7/1/0

85/13/1/1

92/8/0/0

94/6/0/0

93/7/0/0

90/10/0/0

90/10/0/0

Procheck G-factor§, F&Y/All

0.20/0.07

0.14/0.12

-0.44/-0.31

-0.31/-0.32

-0.31/-0.20

0.17/0.19

-0.09/-0.16

-0.32/-0.32

-0.34/-0.35

MOLPROBITY clash score§

20.89

19.20

12.73

29.01

19.80

14.65

16.64

11.2

20.44

DP score§, [%]

72

78

80

70

85

81

80

76

71

Expert time [days]

15

15

17

12

15

20

14

25

35

RMSDbb** [Å]

0.64

0.57

0.66

0.70

0.69

0.60

2.07

1.10

2.03‡‡

RMSDall†† [Å]

1.29

1.14