MIT CSAIL Tomaso Poggio:智能科学与智能工程

衡云韶

2017/11/14 发布于 技术 分类

MIT CSAIL Tomaso Poggio:智能科学与智能工程_9615785

文字内容
1. The Science and the Engineering of Intelligence Tomaso Poggio Center for Biological and Computational Learning McGovern Institute for Brain Research at MIT Department of Brain & Cognitive Sciences CSAIL Massachusetts Institute of Technology Cambridge, MA 02139 USA
2. Engineering of Intelligence: recent successes
3. Intelligence: engineering
5. Recent progress in AI 5
6. 6
8. Why now: very recent progress in AI 8
9. Mobileye
10. 20 years ago: MIT and Daimler
11. CBMM: motivations Key recent advances in the engineering of intelligence have their roots in basic science of the brain STC Annual Meeting, 2016
12. The same hierarchical architectures in the cortex, in models of vision and in Deep Learning networks Desimone & Ungerleider 1989; vanEssen+Movshon STC Annual Meeting, 2016
13. The race for Intelligence • The science of intelligence was at the roots of today’s engineering success • …we need to make another basic effort on it - for the sake of basic science - for the engineering of tomorrow STC Annual Meeting, 2016
14. Science + Engineering of Intelligence Mission: We aim to make progress in understanding intelligence — that is in understanding how the brain makes the mind, how the brain works and how to build intelligent machines. CBMM’s main goal is to make progress in the science of intelligence which enables better engineering of intelligence. Third Annual NSF Site Visit, June 8 – 9, 2016
15. Interdisciplinary Cognitive Science Machine Learning Computer Science Neuroscience Computational Neuroscience Science+ Technology of Inte15lligence
16. Centerness: collaborations across different disciplines and labs MIT Boyden,  Desimone  ,Kaelbling  ,  Kanwisher,     Katz,  Poggio,  Sassanfar,  Saxe,     Schulz,  Tenenbaum,  Ullman,  Wilson,     Rosasco,  Winston   Harvard Blum,  Kreiman,  Mahadevan,    Nakayama,  Sompolinsky,    Spelke,  Valiant Rockefeller Allen  Institute Freiwald Koch UCLA Yuille Stanford Goodman Hunter Epstein,Sakas,        Chodorow Wellesley Hildreth,  Conway,                          Wiest Puerto  Rico Bykhovaskaia,  Ordonez,      Arce  Nazario Cornell Hirsh Howard Manaye,  Chouikha,                      Rwebargira
17. IIT Metta,   Recent Stats and Activities A*star Tan Hebrew  U. Shashua MPI Buelthoff Genoa  U. Verri Weizmann Ullman MEXT,  Japan City  U.  HK Smale Google IBM DeepMind Honda Microsoft          Siemens Schlumberger        GE Boston   Dynamics Orcam Nvidia Rethink  Robotics MobilEye Third CBMM Summer School, 2016
18. EAC members Pietro Perona, Caltech Charles Isbell, Jr., Georgia Tech Joel Oppenheim, NYU Lore McGovern, MIBR, MIT David Siegel, Two Sigma Demis Hassabis*, DeepMind Marc Raibert, Boston Dynamics Kobi Richter, Medinol Judith Richter, Medinol Dan Rockmore, Dartmouth Susan Whitehead, MIT Corporation Fei-Fei Li, Stanford Third CBMM Summer School, 2016
19. CBMM Brains, Minds and Machines Summer School at Woods Hole: our flagship initiative STC Annual Meeting, 2016
20. Brains, Minds and Machines Summer School In 2016: 302 applications for 35 slots Annual  STC  meeting,  2016
21. Brains,  Minds  and  Machines  Summer  School Broad  introduction  to  research  on          human  and  machine  intelligence   •    computation,  neuroscience,  cognition   •    research  methods  and  current  results   •    lecture  videos  on  CBMM  website   •    summer  2015  course  materials  to  be          published  on  MIT  OpenCourseWare List  of  speakers*:   Tomaso  Poggio     Winrich  Freiwald   Elizabeth  Spelke   Ken  Nakayama     Amnon  Shashua   Dorin  Comaniciu   Demis  Hassabis                 Gabriel  Kreiman   Matthew  Wilson   Rebecca  Saxe     Patrick  Winston   James  DiCarlo     Tom  Mitchell     Josh  McDermott     Nancy  Kanwisher     Boris  Katz     Josh  Tenenbaum     L  Mahadevan     Shimon  Ullman       Laura  Schulz     Lorenzo  Rosasco     Ethan  Meyers     Larry  Abbott       Aude  Oliva     Eero  Simoncelli     Eddy  Chang     *  CBMM  faculty,  industrial  partners STC Annual Meeting, 2016
22. Learning  by  Doing:
 Lab  Work  &  Joint  Student  Projects STC Annual Meeting, 2016
23. An example project across thrusts: face recognition Nancy Kanwisher Third Annual NSF Site Visit, June 8 – 9, 2016
24. A project across thrusts: face recognition Winrich Freiwald and Doris Tsao Third Annual NSF Site Visit, June 8 – 9, 2016
25. A project across thrusts: face recognition Model   ML                              AL                  AM Third Annual NSF Site Visit, June 8 – 9, 2016
26. A project across thrusts: face recognition Model   ML                              AL                  AM Third Annual NSF Site Visit, June 8 – 9, 2016
27. Another project When and why are deep networks better than shallow networks? Work with Hrushikeshl Mhaskar; initial parts with L. Rosasco and F. Anselmi
28. 28
29. 29
30. 30
31. 31
32. 32
34. Hierarchical feedforward models of the ventral stream do “work”
35. Convolutional networks “Hubel-Wiesel” models include Hubel & Wiesel, 1959: Fukushima, 1980, Wallis & Rolls, 1997; Mel, 1997; LeCun et al 1998; Riesenhuber & Poggio, 1999; Thorpe, 2002; Ullman et al., 2002; Wersing and Koerner, 2003; Serre et al., 2007; Freeman and Simoncelli, 2011…. Riesenhuber & Poggio 1999, 2000; Serre Kouh Cadieu Knoblich Kreiman & Poggio 2005; Serre Oliva Poggio 2007
36. Hierarchical feedforward models of the ventral stream do “work”
37. The same hierarchical architectures in the cortex, in the models of vision and in Deep Learning networks 37
40. The same hierarchical architectures in the cortex, in the models of vision and in Deep Learning networks 40
41. DLNNs: two main scientific questions When and why are deep networks better than shallow networks? Why does SGD work so well for deep networks? Could unsupervised learning work as well? Work with Hrushikeshl Mhaskar; initial parts with L. Rosasco and F. Anselmi
42. Classical learning algorithms: “high” sample complexity and shallow architectures How do the learning machines described by classical learning theory -such as kernel machines -- compare with brains? ❑ One of the most obvious differences is the ability of people and animals to learn from very few examples (“poverty of stimulus” problem). ❑ A comparison with real brains offers another, related, challenge to learning theory. Classical “learning algorithms” correspond to one-layer architectures. The cortex suggests a hierarchical architecture. Thus…are hierarchical architectures with more layers the answer to the sample complexity issue? Notices of the American Mathematical Society (AMS), Vol. 50, No. 5, 537-544, 2003. The Mathematics of Learning: Dealing with Data Tomaso Poggio and Steve Smale
43. Deep and shallow networks: universality r ∑g(x) = ci < wi , x > +bi + i=1 Cybenko, Girosi, ….
44. Classical learning theory and Kernel Machines 
 (Regularization in RKHS) ∑min f ∈H & $% 1 ℓ ℓ i =1 V ( f (xi ) − yi ) + λ f 2# K !" implies ∑f (x) = l i α i K (x, xi ) Equation includes splines, Radial Basis Functions and Support Vector Machines (depending on choice of V). RKHS were explicitly introduced in learning theory by Girosi (1997), Vapnik (1998). Moody and Darken (1989), and Broomhead and Lowe (1988) introduced RBF to learning theory. Poggio and Girosi (1989) introduced Tikhonov regularization in learning theory and worked (implicitly) with RKHS. RKHS were used earlier in approximation theory (eg Parzen, 1952-1970, Wahba, 1990). For a review, see Poggio and Smale, The Mathematics of Learning, Notices of the AMS, 2003
45. Classical kernel machines are equivalent to shallow networks Kernel machines… XY ∑f (x) = l i ci K (x, x i ) + b K KK can be “written” as shallow networks: the value of K corresponds to the “activity” of the “unit” for the input and the correspond to “weights” C1 C n CN + f
46. Deep and shallow networks: universality r ∑g(x) = ci < wi , x > +bi + i=1 Cybenko, Girosi, ….
47. Deep and shallow networks • Thus depth is not needed to for approximation r ∑g(x) = ci < wi , x > +bi + i=1
48. Deep and shallow networks • Thus depth is not needed to for approximation • Conjecture: depth may be more effective for certain classes of functions r ∑g(x) = ci < wi , x > +bi + i=1
49. When isGdeeneepricbefuttnecrtitohnasn shallow f (x1, x2,..., x8 ) Compositional functions f (x1, x2,..., x8 ) = g3(g21(g11(x1, x2 ), g12 (x3, x4 ))g22 (g11(x5, x6 ), g12 (x7, x8 ))) Mhaskar, Poggio, Liao, 2016
50. When is deeTpheboertetemr:than shallow why and when are deep networks better than shallow network? f (x1, x2,..., x8 ) = g3(g21(g11(x1, x2 ), g12 (x3, x4 ))g22 (g11(x5, x6 ), g12 (x7, x8 ))) Mhaskar, Poggio, Liao, 2016
51. When is deeTpheboertetemr:than shallow why and when are deep networks better than shallow network? f (x1, x2,..., x8 ) = g3(g21(g11(x1, x2 ), g12 (x3, x4 ))g22 (g11(x5, x6 ), g12 (x7, x8 ))) r ∑g(x) = ci < wi , x > +bi + i=1 Mhaskar, Poggio, Liao, 2016
52. When is deeTpheboertetemr:than shallow why and when are deep networks better than shallow network? f (x1, x2,..., x8 ) = g3(g21(g11(x1, x2 ), g12 (x3, x4 ))g22 (g11(x5, x6 ), g12 (x7, x8 ))) Theorem (informal statement) Suppose that a function of d variables is compositional . Both shallow and deep network can approximate f equally well. The number of parameters of the shallow network depends exponentially on d as O(ε −d ) with the dimension whereas for the deep network depends linearly on d that is O(dε −2 ) Mhaskar, Poggio, Liao, 2016
53. Shallow vs deep networks This is the best possible estimate (n-width result) Mhaskar, Poggio, Liao, 2016
54. Similar results for VC dimension of shallow vs deep networks Poggio, Anselmi, Rosasco, 2015
55. When is deeTphbeoertetemr than shallow Suppose that a function of d variables is compositional . Both shallow and deep network can approximate f equally well. The number of parameters of the shallow network depends exponentially on d as O(ε −d ) with the dimension whereas for the deep network depends linearly on d that is O(dε −2 ) New Proof. Linear combinations of 6 units provides an indicator function; k partitions for each coordinates require 6 k n units in one layer. The next layer computes the entries in the 2D table corresponding to 6kn g+((x61k, xn2)^) 2; they also correspond to tensor products. Two layers with units represent one of the g functions. For convolutional nets total units is (l (6kn + (6kn)^2)) Mhaskar, Poggio, Liao, 2016
56. Our theorem implies directly other known results • A classical theorem [Hastad, 1987] shows that deep circuits are more efficient in representing certain Boolean functions than shallow circuits. Hastad proved that highly-variable functions (in the sense of having high frequencies in their Fourier spectrum) in particular the parity function cannot even be decently approximated by small constant depth circuits • The main result of [Telgarsky, 2016, Colt] says that there are functions with many oscillations that cannot be represented by shallow networks with linear complexity but can be represented with low complexity by deep networks. 56
57. When is deeCpobroeltlateryr than shallow Our main theorem implies Hastad and Telgarsky theorems. Use our theorem with Boolean variables. Consider the parity function which is comx1pxo2s.it.i.oxndal. Q.E.D For the second part, consider for instance the real-valued polynomial x1x2 ...xd defined on the cube (-1, 1)^d. This is a compositional functions that changes signs a lot. Q.E.D. Mhaskar, Poggio, Liao, 2016
58. The curse of dimensionality, the blessing of compositionality
59. The curse of dimensionality, the blessing of compositionality For compositional functions deep networks — but not shallow ones — can avoid the curse of dimensionality, that is the exponential dependence on the dimension of the network complexity and of its sample complexity.
60. Why are compositional functions important? They seem to occur in computations on text, speech, images…why? Conjecture (with Max Tegmark) The hamiltonians of physics induce compositionality in natural signals such as images
61. 61
62. 62
63. 63
64. Remarks 1. A binary tree net is a good proxy for ResNets 2.Scalable algorithms and compositional functions 4. Invariance and pooling 6. Sparse functions and Boolean functions
65. Convolutional Deep Networks (no pooling like in ResNets)) x1 x2 x3 x4 x5 x6 x7 x8 x1 x2 x3 x4 x5 x6 x7 x8 Similar theorems apply to the network on the left and the network on the right in terms of # parameters
66. Hyper deep residual networks: a binary tree net is a good mathematical proxy ∼ x1 x2 x3 x4 x5 x6 x7 x8
67. Remarks 1. A binary tree net is a good proxy for ResNets 2. Scalable algorithms and compositional functions 4. Invariance and pooling 6. Sparse functions and Boolean functions
68. Shift-invariant, scalable algorithms Mhaskar, Poggio, Liao, 2016
69. Qualitative arguments for compositional functions in vision • Images require algorithms of the compositional function type • Recognition in clutter requires computations compositional functions with
70. Remarks 1. A binary tree net is a good proxy for ResNets 2.Scalable algorithms and compositional functions 4. Invariance and pooling: interpretation of nodes in binary tree 6. Sparse functions and Boolean functions
71. Comment on i-theory • i-theory is not essential for today theorem; it represents s further analysis of convolutional networks and extensions of them • i-theory characterizes how convolution and pooling in multilayer networks reduces sample complexity (—>Lorenzo) • Theorems about extending invariance beyond position invariance and how to learn it from the environment (—> Lorenzo) Anselmi and Poggio, 2016, MIT Press
72. Remarks 1. A binary tree net is a good proxy for ResNets 2.Scalable algorithms and compositional functions 4. Invariance and pooling 6. Sparse functions and Boolean functions
73. Sparse functions Mhaskar, Poggio, Liao, 2016
74. More remarks • Functions that are not compositional/sparse may not be learnable by deep networks • Deep, non-convolutional, densely connected networks are not better than shallow networks; DCLNs can be much better (for compositional functions) but not for all functions/ computations • Binarization leads to consider sparse Boolean function
75. DLNNs: two main scientific questions When and why are deep networks better than shallow networks? Why does SGD work so well for deep networks?
76. Parenthetical comment on i-theory • Convolution and pooling in multilayer networks reduces sample complexity • Theorems about extending invariance beyond position invariance and how to learn it from the environment Anselmi and Poggio, 2016, MIT Press