What neural activity encodes about stimuli, where and when
What neural activity encodes about stimuli, where and when


[ Silence ]>>So, what I’d like to talk about is the not visual
perception, but very related to that because, of
course, visual perception, which many people here are
studying ends with figuring out the meaning of the
thing that we’re looking at and also representing that. So, along with a number of
people we’ve been looking for the last years
at the question of how neural activity encodes
really meanings of words and, well, let’s say word meanings. And, when I say other people,
I mean a collection of people. And, I put some red
boxes around the people who are in this very room. So, Gus and Lela
and Alana are here and are largely responsible
for the work here. So, as I mentioned, we’re
interested in the question of how neural activity
codes word meaning. And so, we’ve been doing a
number of brain imaging studies with primarily FMRI
but more recently MEG where we present these
kinds of stimuli to people, sometimes line drawing
sometimes written words, sometimes line drawings with
written words under them, they’re always consistent,
and sometimes spoken words. And, in an FMRI scanner
then, if we show somebody, let’s say a stimulus like
this model line drawing with the word bottle
written under it, then, we observe the FMRI
activation like this. These are four slices of a
three dimensional FMRI image. Posterior is up so most
of that red activation is in visual areas. And, if you ask me, well,
what’s the difference between what we see
when we present bottle and your average word? So, here’s the mean
activation that we observed across 60 stimuli
which looks a lot like the activation per bottle. But, there is a difference. And so, in fact, you
can see the residue here if we subtract the mean
activation out of what we see for bottle, there
is some difference. And, many people have used FMRI
in various kinds of classifiers to show that, indeed, FMRI
activation does allow us to distinguish different
categories of pictures and words and so forth. So, we’re particularly
interested in coming up with some kind of
computational model that would require us
to understand something about the structure of
these representations. And so, for example, like
your computational model where we could give, as
input, an arbitrary word, actually for now an
arbitrary concrete noun in English like celery. And then, if the model
really captured something about how these neural
representations work, it would be able to
predict the FMRI activation when we show somebody that word. So, we’re interested in
then this, the approach that we’re taking actually
ties into the scheme of latten representations,
abstract representations that Jim talked about
and Jeff talked about and a number of others. And, the model that we developed
first, if you think about trying to build such computational
model, really, the first question you have is
well how would I represent the meaning of an arbitrary
word like celery in a way that would be computer
manipulatable [assumed spelling] so that we could train something
to predict the FMRI activation? The first approach that we
took was to steal an idea from computational linguistics
which is that you can represent, in a computer, an approximation
to the meaning of a word by capturing statistics
of how that word is used in a very large collection
of text. So, our first model,
in fact, used statistic from a trillion word
corpus of text, which Google collected
from the web. And, given an arbitrary
word like celery, it would represent celery as
a collection of 25 features, each feature corresponding
to some verb. You see here, for example, the verb eat co-occurs
a lot with celery. Taste occurs a fair
number of times. Ride occurs very
rarely with celery. So, we had a collection
of 25 verbs which, well, to be honest, I made up. But, they involved a lot of
sensory motor type verbs. And, we can see for
a different word like airplane we get a very
different profile of statistics. And, you’d have to agree with me that these statistics
capture some aspects of the meaning of the word. Okay so, given that, our
model then could be trained. We use that feature, that
co-occurrence statistics, as our representation
of the stimulus. And then, we can train a
regression model to predict, say with the FMRI activation
would be for the word like celery, by taking those
[inaudible] for co-occurrence with eat and training
essentially a voxel by voxel regression model to
predict that each location in the brain, what the
activation would be as a function of how frequently
our input noun stimulus co-occurs with each
of these 25 verbs. And, you can see that we had
some training data, a collection of nouns in known FMRI
images for those nouns, then, we could look up these verb
co-occurrences as properties of stimulus, train this
large regression model, and, indeed, that’s what we did. So, in the end, the
prediction that each voxel of v is just the sum
over 25 features, each feature is the frequency of
the stimulus word with verb I, eat, taste, etcetera, times some
coefficient that tells how much that particular feature
contributes to voxel I, or voxel V, okay, so
standard linear regression. There are a lot of
parameters in this model. We’re not afraid of training
models with more parameters than we have training examples. One thing that makes this work
well is that, if you think about it, there are 20,000
voxels here approximately. So, we’re training 20,000
different independent linear regressions. So, the number of parameters
per linear regression is much smaller. And, anyway, between that and
some significant regularization, we can make this work. So, don’t believe
anybody who tells you, you can’t train useful models if
you don’t have more data points than you have features. Okay so now, does this work? In fact, here, if we train on
a collection of other words and then present the novel
to the model word celery and airplane, you see
the predicted images and the observed images. You see they’re not
perfectly correct predictions but they capture some of the
structure that’s going on here. And, by the way, this blue
streak is basically the foreign region. Okay, so that’s one way I can
show you what the model does. Another way is more
quantitatively, we can just measure how accurate
would this model be if we were to give it pairs of words
that it was not trained on. Say, we didn’t train it
on celery and airplane. And, we give it the pair of FMRI
images that go with those two that come from those two words. And then, we ask it
which one is celery and which one is airplane. So, if it’s guessing, at
chance, it’ll get half of those correct by chance. Given the number of
examples we have, to be statistically
significantly above chance, we need a little
bit better than .5. And, in fact, when we try this,
train the model independently on nine different
subjects in the experiment, then it does significantly
better than chance. So, three times out of
four, if you give it a pair of words it hasn’t seen,
it can match that up with two FMRI images
it hasn’t seen. Okay, so there’s
a quantitative way of measuring how
well a model is. Okay so, another way of
looking at this model and seeing how much we like it
is to ask what are the loadings, what are those learned
coefficient sets for each of those verbs, for
each of those 25 verbs? So, for example, here is,
let’s take the verb eat. These are the learned
coefficients from the training data. And remember, we never gave
a verb as the stimulus. We never gave eat as a stimulus in the training,
we only gave nouns. This has been indirectly
solved for this collection of coefficients that basically
represent the activation that would be contributed to the
final prediction as the function of how frequently the noun
stimulus co-occurs with eat. And so, what this is telling
us is, if we have a noun that occurs a lot with the
verb eat, we’ll see a lot of activation here which some
people had previously referred to as gustatory cortex
developed in the sense of taste. Another verb push
gives us a collection of adversative activation in
[inaudible] sensory regions. So, we’re about half, not all,
of those 25 verbs in the model. If you look through the
learned coefficients, you see some interesting
correspondence between what we think is
the meaning of the verb, the activations, and the
activations being predicted in regions that are
cortical regions associated with that action. So, this is very
consistent with what a number of other people had
pointed out that much of the representation neural
representations we see for word meanings is grounded
in sensory motor regions. So, this is just another
way of observing that. Okay so, so there’s the model. And, one thing I
want to point out is that this is pretty much
analogous to the story that you heard earlier
from Jack. So, Jack had a similar
kind of story to tell us where he actually said I
had some stimuli movie, a time series movie. I’d like to predict the sequence
of images and I’m going to come up with some abstraction
that’s going to mediate between the stimuli and
the observed FMRI data so that I can train a model. And, just like us, he came
up with a set of features that represent the
stimulus that are computable from the stimulus, in his case with a visual processing
algorithm and from some [inaudible], and then create edges,
regions and so forth. And then, given that
essentially feature vector, it wasn’t our 25 verbs, it was
a much more subtle rich set of features, we also heard that
he used large scale regression to make that prediction. So, I just wanted to kind of
highlight this correspondence between some of the other
things we’ve been hearing here. So, back to the main
story though. So, here’s our model. Now, the first thing
you might ask about this model is
those 25 verbs, why. If you think about it, what this
model’s essentially saying is we can predict neural activity
when somebody thinks about a word based on
how it’s used on the web. And, in particular, in terms of co-occurrence would
be specific 25 verbs. And, that can’t possibly
be the optimal model for predicting their
own representations. So, we got interested in
what would be a better model or what would be a better
set of semantic features to characterize this
[inaudible]. So, we started exploring
that in a number of ways. We had a lot of friends who
had suggestions about what kind of features we should
use instead. Try more words. Try all the words in English. So, we tried a number
of different things. But, it didn’t actually
improve our cross validated model accuracy. And then, one day, Dean
[Inaudible] who is part of this research project came in and said I found
something that works better. And, what he had done
is come up with a set of 218 questions very similar
to the ones that you would think of if you were playing
20 questions. And, he had gone to Amazon’s
mechanical search service. He had gotten people there
to answer these 218 questions for a thousand common
nouns in English. And, when he used the answer to these questions
obtained behaviorally from mechanical Turk workers,
he finally came up with a model that beat our corpus statistics. So, in some sense this is a
better more accurate basis set of features for supporting
for mediating between words and their neural representations
observable by FMRI. Okay then, fortunately,
[Inaudible] one of our PHD students, came up
with an even better model. And, it was actually based on these same mechanical
Turk drive features. But, [Inaudible] said, well
this is fundamentally a machine learning problem. What we want to do is we want to
find the definition of this set of features that will give
us the most accurate system. And, he came up with
the following model of which I want to show you. So, unlike the first model,
this model is trained using data from multiple subjects not
on a subject specific basis. And, also unlike the model
that I talked about before, the overall model includes both
subject specific parameters that are being estimated
differently for each individual person and then subject
independent parameters that are also being estimated. So, I’ll show you the model. But, basically, if he comes
up with the set of features that will characterize
that are going to mediate between the stimulus
features and the data by in a three step algorithm. So, in step one, he uses
canonical correlation analysis which is a technique of
which light PCA can give you abstractions of data
or like what Jim talked about earlier can give you
abstractions of data on, in particular, what CCA does
is it finds a collection of features. The first one it finds
is the feature that has, it learns a linear mapping
a linear function that maps from each dataset to define the
features the linear combinations of the voxel intensities. But, it learns one for
each dataset and it learns that collection of linear
mapping specifically have the following property. The projected points, according
to that linear mapping, have the maximum correlation
possible among linear mappings. So, basically, their looking
for linear mappings from each of these 20 different
datasets to points that will be maximally
correlated across the 60 stimulus
words that we were using. And then, given that, it
finds a second component which is also the next most
correlated feature subject to being uncorrelated
with the first feature. And, it goes on. So, you can think of it as
filling the same kind of role that Jim was talking about this
morning when he was telling us about how to combine data across multiple subjects
using rotations, parameterized rotations, for
each different dataset plus PCA. And, those two things
done in sequence, he showed us how those can
also give a lower dimensional representation of the data along with essentially a linear
function, those the composition of two linear functions to
get to that distraction. So, I think, part of the
theme that’s emerging from this meeting is a lot of
interest and ideas about how, about the importance of coming up with subject independent
representations of data that take into account
the kinds of variation that we see cross the stimuli
that are being brought. Okay so, back to
[Inaudible] approach. So, [Inaudible] does that
he happens to see CA instead of the approach we’ve
heard about before. But then, in a second step,
he links up these features that were derived from the
FMRI variations to the stimuli. How? Well, we have these 218
mechanical Turk features. So, let’s just learn a linear
regression to predict each of these factor loadings or
sorry each of these values for each of the subjects. And so, this ends up being, these parameters essentially
say we can predict each of these features which
are subject independent. These parameters are
subject independent. And then, given those subject
independent characterizations of, well they came
out of the FMRI data. Didn’t they? But now, they’re being actually
predicted from the stimuli. Then, we can just go ahead, oh
we have to convert this arrow which is the third step
which is algorithm. So, it just inverts these
matrixes that we got out of CCA. You have to use a
pseudo adversity because there are
many adversities. And so, we pick, we
regularize and pick one that minimizes the
weight source essentially. So now, we end up with
this feed forward model that does what we want. And, it starts with the
stimulus codes it by a look up of these behaviorally
[inaudible] semantic features goes through this
intermediate distraction which is subject
independent and then maps in the subject specific
way to each dataset. And this is the model that
currently that we have that gives us the best results. This got us up to 86% in that
distinguishing two novel words and matching them up
with the novel images. And so, and well, so, let me
just ask them the same question that I think I asked
one of the speakers. What are the meanings of these
features that get discovered? So, we can look at that
in, of course, two ways. One, just as Jack was talking
about earlier, we could ask for, let’s say, the first
CCA component, what are the stimulus
words that most excite that CCA component according
to our learned regression? And so, for component
one, out of the 60 words, here are the top five that
most excite component one, the stimuli. Here are the ones for
component two and so forth. I’m not sure how to name them. But, maybe, I would
call this one shelter. These are things that I
can go inside and hide. Maybe I will call this one
stuff I can manipulate. I don’t know what to
call the third one, weird collection of stuff. Okay but, that’s one way of
looking at these components. Another is to ask well what
are the feature loadings? Just as Jim was showing
us in his talk. So, here is a three dimensional
FMRI image of loading through that first
CCA component. And, I didn’t, it
was on the slide, but I didn’t point this out. It’s important, a detail of
these, there were 20 datasets that we were working
with at the time. Half of those datasets involve
people getting word only stimuli printed test words. The other half of the datasets
involve people getting line drawings with words
underneath them. One of the subjects
was a subject in both of those experiments. So, what I want to show
you is the loadings for this factor for
that subject. And, as it sets at
the top of the slide, this is for the word
only stimuli session that he, the subject provided. And again, this is a
posterior interior. And so, these are the loadings for that shelter
style component. When the person was involved, it
saw the stimulus only as a word. Now, let me show you what
happens when we switch to the other dataset where
he saw the stimulus as a word with a line drawing
accompanying it. So, what do you think happens? That’s what happens. So, we get a lot more
activation in posterior regions. Probably that’s not a big
surprise to this crowd. But, if you look at what happens
when they toggle back and forth, except for those
posterior regions the neural representations or the codes
are actually quite similar across the two. So, it’s sort of like it didn’t
matter whether it was a word or a word accompanied by
a line drawing as long as you’re only looking
from here forward. If you’re looking behind there
than it actually mattered. That’s an overt simple
implication. But if you look at this
for a while you’ll see that it’s remarkably too. So, this model, as we mentioned,
is a collection of datasets to learn an infraction and
then to learn them separately in a separate stage a mapping
from our stimulus to those so that we can make
it into a full model. We did this again
a different way. Instead of using CCA we
used factor analysis. And, I want to show
you the results of that because we also tried
an interesting little experiment there. Remember when I said I
don’t know quite what to call this component
but maybe it’s shelter? Well, we did this again
with factor analysis. We only used the words
only stimulus data. So, there were no line drawings or pictures involved
in this stimulus set. And, when we did it, we got the
following four top components that came out of
the factor analysis. And, each of these components
is shown here in terms of what regions in the brain
it corresponded to and also, as before, which were the
words that most, stimulus words that most excite that
component in the final model. And so, again, we have this one
that you might call shelter, you might call manipulation,
you might call eating and that we’re pretty sure is
word length, this fourth one. But, and this model
works pretty well. But, the interesting
thing is this number, the 80% accuracy number, comes from the following
way of testing the model. Giving two novel words to
it, we have the FMRI data, but we withheld it
during training. Instead of predicting with
linear regression the values of these loadings, we actually
just behaviorally asked people, an independent collection
of people, to go through the 60 words and
label them on a scale from one to seven according to how much
they made them think of shelter, manipulation, and eating. And, we used those
behavioral ratings instead of regression predictions. Why? So that we could test
whether our intuitions about the meanings of
these components jive with what people think of when
they behaviorally just look at the words and think about well is this
about eating or not. And, in fact, those behavioral
ratings are shown here on the vertical axis for
each of the 60 words. And, the factor scores
assigned by our model are shown on the horizontal axis. And so, you can see, for
each of them, a correlation between the model derived factor
scores for these components and the behaviorally
obtained assessments about how those 60
different words, how much they make you think
about manipulation for example. And here, you can also
see that fourth component. We didn’t ask people how
much it made them think about word length. But, this was literally
the number of characters in the words which
correlates extremely well with the fourth component. So, so one possible conjecture
that one could make based on this is that in each of
these types of concrete nouns, perhaps, some of the dominant
factors that contribute to the neural representations
for those 60 nouns correspond to these kinds of
semantic features. All right, so, now I
want to switch a little. I mentioned at the beginning
that we’ve been looking at this question with both
FMRI and MEG more recently. So, I want to point out
that the wonderful thing about MEG is that it’s fast. We can get an image
per millisecond. We don’t have to worry
about old response. The bad thing about
MEG is instead of getting millimeter
size resolution, you get golf ball
sized resolution. So, I guess you have to give
up something to get something. So, we’ve been collecting
data like the following. I want to show you a movie of the 550 milliseconds
following the point at which this person
observed the text H A N D with a line drawing
of the hand above it. So, here’s what happens
in this person’s brain. And, over here, you
see the time clock. So, we’re starting at 20
milliseconds pre-stimulus. And here it goes. So this is the kind
of data we get from MEG while somebody
is comprehending a word which takes about half a second. [ Silence ] So you can see there’s a lot of
different stuff going on here. And, with FMRI, we’re
not able to see any of that sequential
structure in what’s going on. So, when you look at that,
you’re kind of struck that well maybe these
different features are appearing at different times. Maybe they’re being
coded at different times. It’ where and when, in
this 500 milliseconds, are those 500 features
actually being coded by the neural activity? That’s a question we
got interested in. So, here is a plot, again,
of the first, up to 6/10ths of a second, so 600
milliseconds. And the different curves show in
each of these selected regions, mostly parietal and occipital
and more superior regions. The color of the boarder of
the region here corresponds to the color of that plot. And that plot is showing
how much activation there is from MEG in those regions. But, the interesting
question, of course, is what information is being
coded when in those regions? So, what we’d really like
is a plot here across time of how accurately can we
decode from this region as a function of time. Say the word identity, and
so that plot looks like this. And, this is if, this
90% is what we get if we use the entire brain. This black line is what
we get if we use the union of these particular regions. But, the important
thing to point out here is the decode ability
of the word semantics peaks out at around 400 milliseconds
even though the magnitude of the activation has died
down quite a bit by then. In fact, if you look at them
moving you see that most of the activity for make was
well before 400 milliseconds. So, it’s not the case that where
we see the greatest activation is where we’re seeing
the information coded, it’s quite the opposite. So, here’s another
way of looking at it. What I wanted to show you is
we used these 218 mechanical Turk features. We added in another 11
perceptual type features because we were using
line drawings with words. So, we made up just a simple
set of perceptual features like this line drawing
vertically oriented or is I diagonal left, no right. Is there a lot of internal
structure in the line drawing or is it pretty much
just simple? Things like that, so, very
simple perceptual features that gave us the 218 plus
these 11 229 features. Then, we found that these
were the regions in, the cortical regions where
those 229 features had a subset that were mostly code able. And then, we trained
a classifier that, on 50 millisecond windows
moving forward in time. And so, what I’m going to show
you is what features are decoded when and where in each
of these regions during that 500 milliseconds. So, at 15 milliseconds the
answer is none of them. And, at 100 milliseconds
these are features. So, an occipital, the program
can decode word length, write diagonal as verticality, and in a couple other
regions as well. At 150 is this, by the way
I’m using lower case to refer to the perceptual features
and uppercase to refer to what we consider
semantic features. So, so far is all percepts. Oops, 200, so here’s is it hairy
is our first semantic feature. Is it hollow? Is it an animal? Is it made of wood? [ Silence ] Yellow isn’t the best choice. The yellow one says is it
bigger than a car on top and at the bottom it
says is it manmade, was it ever alive,
is it manufactured. I like does it have feelings. [ Silence ] So, that’s it. That’s the sequence
of information out of those 229 features that are best decodable
during the 550 milliseconds that it takes you to read
and understand a word. So, and that’s only a
glimpse of it, all right, we had to have data for
229 features across time. This is, sort of those are the
most decodable of the features in each of those regions. But, you see that the point is
that now we can start looking at the evolution over
time of the appearance of these decoded features. Here’s another way
of looking at that. What I was showing you in the previous animation was
the most decodable features of every window. But, what if I just take
one of the features, let’s say word length on top or
alive on bottom, and, I asked, at the 50 millisecond window
that begins at 100 millisecond and goes to 150 where, where are
the weights that are used to try to decode word length? You see they’re very
[inaudible]. And, as we move forward in
time it basically just gives up because you can’t
decode it anymore. Whereas, for alive,
it’s interesting. It starts out posterior
but as we move forward in time it becomes
more interior. So, you can see that
even the coding of that information is
kind of crawling around. Okay so, another way of kind of
summarizing the result is ask, out of those 218
semantic features, which ones were most decodable
at any time at any place across the whole brain? And here’s the list. And, interestingly, if
you look at that list, you notice a number of those
top feature’s really size. Now, not the size
of the line drawing, the line drawings covered
the same angular distance for everybody. This is the real object size. The bee that we presented
was the same size of the house we presented in
terms of the line drawing. But, this is the actual size. That’s one theme you see. Another is manipulability. [Inaudible], is it alive? And maybe shelter,
can it keep you dry? Is it clothing? Does it open and is it
hollow might be related to shelter as well. So, again, we’re seeing these
kinds of themes emerge in terms of the factors or the components
of meanings that are showing up in these decodings. Okay so, if you ask
me at this point if you could only have
five features to use to decode word semantics,
what would they be? If I had to guess and
they were concrete nouns like the ones we’ve been working
with I would say these five. But, I might be wrong. Based on the evidence we’ve
had I think that’s our best hypothesis. But, in fact, in terms of
modeling, and the models that we’re studying
now, we just want to use as many features as we can. There’s no harm in being
over complete in terms of your characterization
of your semantics or your percepts
in the stimulus. So, that’s, so it’s a
very much open question and that’s our partial result. So, I want to end by just
mentioning sort of coming back to the top level and saying, you
think just not about this talk but a number of the talks
that we’ve heard here today and yesterday about
where can we, where can the field
move forward. I think this idea of
coming up using algorithms to devise intermediate features
that can be related both to the stimuli that we’re using and to the observed
neural activities so they become the mediators
for mapping back and forth from stimuli to the neural
activation is actually a really interesting and important idea. And, it’s just really five
years ago we didn’t kind of have that idea. And now, it’s starting to
emerge that there’s several ways that people’s doing this. I think it’s important. And, one of the questions
I hope we get to in the discussion period is so what will we use
these encodings for beyond what we’re
already doing? And, I want to end by
suggesting one thing that we might do
is we might start to build computational matters that operate off those
learned encodings or those intermedian encodings. So, for example, one
question we’re interested in language processing is
if I say to you bear, grr, bear and get your FMRI
activation, then if I say to you hungry bear,
what’s the relationship between the neural coding
for hungry, for bear, and for hungry bear the phrase? How does your brain
compose the meaning of these multiple
symbol phrases? And, you know, using our initial
model we might give the word bear, get some semantic
feature coding of it, predict the neural activity so we have some established
correspondence for one of those words in the phrase. But, if we have two words,
we want to build a model in the future where we give
it an adjective and a noun. Well, we know how to represent
somehow what the noun means and we know how to represent
say the co-occurrence statistics for the adjective too. And, if we had a coding of that
for the phrase then we know how to predict it, because we
can learn these coefficients from these two vectors. So, a model for predicting
adjective noun neural representations for arbitrary
adjective noun phrases is going to need this missing component. It’s going to, one way
to do it would be to come up with some function that’s
our hypothesis about how to code for these two words
gets composed into the code for the phrase. And, one of our students Kimen
Change [assumed spelling] did a first step of that
which he published in Computational
Linguistics Conference, he simply considered, took the
simplest possible approach. He looked in the literature,
psycho linguistics literature, and two competing models were
this vector should be the conjunction of the features
here or the disjunction of them. And so, he tried those out
with FMRI data for the phrases. And so, here you see
the verb frequencies for soft and for bear. Here you see the codes
for the noun phrase. If you just take the
adjective, just take the noun, just take the sum of
those soft and bear, or if you multiply
them one by one. And, in his model, he
found the multiplicity, the and conjunction of those
features gave a better result than the other three models. So, that’s not to say
that he solved the problem of how our brains
put together words. But, I want to put this thought
to kind of open up our thinking about what we might do now
that we have algorithms for devising cross subject
represent abstractions of the FMRI activity or
the stimulus properties, then these really could
become the grist for a lot of interesting cognitive
computational models. And, I think this
suggests the kind of approach one might go about. If I had to do this again, I’d
use machine learning to learn that function instead of giving
it two choices and stopping. So, I’ll just leave this
up because I don’t want to, I would love to set
the precedent of not hearing that second bell. So, I’ll stop here and just
leave up a list of things of ideas about how I might
use those representations going forward. Thanks. [ Applause ]>>So, on the MEG data, I’m
back here, so, on the MEG data where we were decoding
the 20 questions words, is it vague is it, you know,
can I hold it in my hand, it was interesting because I
would’ve, intuitively, you know, just in the face of no, of no
information, I would’ve guessed that there would’ve been some
kind of systematic pattern in time like things that you
could hold would occur earlier than like is it hairy,
I don’t know. And, it seemed like
it was kind of random. Is there anything
systematic in the kinds of things you can
decode over time?>>Nothing that we believe so
strongly that we’d publish it. But, it does seem that there
are a lot of features associated with anomicy there early. Like is it hairy
might be one of those. And, you know, several of the
talks have led to a follow up discussion about aren’t
your conclusions very sensitive to a particular set of
stimuli that you’re turning on? And the answer, in our
case, is yes they are. And so, it may be that is
hairy is just a synonym for is it alive relative to
the stimulus that we’re using. But, if we find anything,
I would say it’s that maybe anomicy is early. Gus, would you have
anything to add to that? And we were surprised,
when we started this, we had little discussions where
we were hoping expecting that, you know, there’d be these
kind of semantic features that would lead to that
that would lead to these. And, what we found was not
that really it’s just kind of poof they’re all there.>>Because it also seemed like the spatial
distribution was kind of random. It wasn’t, another
way you could think of this happening is
this is the semantic, abstract semantic features
that you can decode in the back of the head would decode
first and then ones that require more anterior
computation would occur later. But, that also didn’t, I didn’t
see that in the [inaudible].>>There were quite a
few features in there that appeared first
in posterior regions and then subsequently
in more anterior. So like, well actually
it started earlier. Word length, for
example, starts there, I guess it’s pretty much there.>>All right, I was thinking
about the abstract features.>>Yeah [inaudible]
that kind of made sense.>>Yeah, this one’s
kind of interesting. [ Silence ]>>I’d like to take issue with
the claim that we don’t have to worry if we have more
degrees of freedom in the model than we do data, because
we absolutely have to worry because you said
something at the end where I think you put a copy
out which is that you have to put a significant
regularization on the models. And, that’s effectively
reducing the degrees of freedom in the model. So, I think, in fairness, we
do have to worry about it.>>That’s fair. I, you know, I’d worry about
everything to different degrees. But, fitting models with more
parameters than data is not in my top ten list of worries.>>Okay, it’s in my top ten which is why I was
taking it to you.>>Okay. [ Inaudible ]>>Well, you should care just
about accuracy or accuracy to classify you would care,
right as long as it works. If you want to construct
a model, you should analyze deep inside
what those weights mean, right. That’s why you want model
not just reach generalize but has meanings. And there, you, I
would [inaudible].>>Okay, so the concern of course is the
generalizability right. So, I can train any model with more degrees of
freedom than data. But, it’s not going to
generalize in an unusual way. But your point is
well taken too, the more parameters you
have the harder it is to analyze the interior of it. So, I mean, my concern is the
generalizability, all right. How’s this thing going to go with the data it
hasn’t seen before?>>I see, so, yeah, I guess my
response to that is, you know, honestly, cross validation
does not lie. So, I don’t care how
you train the model. If I try it out on
another dataset and it goes well I’m
going to believe it.>>I agree. But, that’s because you put
the regularization, right. Without the regularization, that’s important,
right, without that.>>Oh, yeah, yeah, yeah, using the right algorithm
is important including regularization. I was just pointing out
that I disagree strongly with what I’ll call
conservative statistics that says you can’t do this. In fact, you can and you
must for the kind of models that we’re talking about. [Silence]>>Yeah, I would second
vote for those points. I mean, you’re effectively
making a, you’re putting a prior
assumption which is hidden in the regularization
some way, which you could, in another model
make more explicit. So, of course, you can do it. It’s a question of
interpretation. But, my larger question was, I
kind of lost track as to kind of where, you know, you started
by showing brains and decoding but you seem to end
up with conclusions that I don’t know to
put a point on it. Did I need brain imaging
to get to your conclusions? So, I’m missing what the brain
was used for in these analysis.>>Oh, I see. Well let’s see. So, we were having them,
fundamentally, we’re talking about a model where,
a computational model where you give it a stimulus and it predicts the
brain activity, so.>>So, you’re using
predicting the brain activity as a proxy goal for
as the goal of itself. That’s the only reason we
need to brain is as a thing to predict, not to tell us
about the underlying structure of language because
again that could immerge without the brain, right. Is it just predicting
brain activity as a goal in and of itself its useful?>>I see, right,
right, no, no, no. So, the goal really
is to understand, the long term goal is to
understand language processing. The neural term goal is to
understand just the sub-question of how are neural
representations of meaning related to the
stimulus like a word in that that is the index into your
meaning representation, right. And, I think, the main
conclusion is the neural representations, so here
are two alternatives that could’ve been true
before we did this work. It could’ve been that there’s a
big hash function in your brain, every word gets a different code and there would be a different
FMRI image for every word but there won’t be any
system multicity there. But, we know that’s not true. We know there is
system multicity in fact so much system multicity
that if I just code that word with four numbers, which I
will formally call shelter manipulation and ability in word
length, then, I can distinguish, 80% of the time, two new
words that have not been in the training set but are
still concrete nouns based on their two FMRI images. So, so it’s not, yeah, so it’s
sort of we are taking the idea that if we want to,
here’s the logic. If we want to understand
the structure of neural representations
of meaning, the best way I can
think of doing it is to build a computational model that makes predictions
extrapolates beyond the stimulus that we’ve been able to give it
for training, makes predictions for arbitrary new words, and then see whether
it’s right or wrong. If it’s wrong, then,
we’ll have to go back and change that model. So, that’s where
the brain comes in.>>So, maybe a more
psychological way of asking the question that Jim
just asked is what did you learn about language that you didn’t
suspect or didn’t know before? That’s one. And then, the second part of
this question is as struck by [inaudible] the broad
activation of this amount of sensory cortex when you had
associated actions and so on. It seems to me that it would
be very interesting to do, to take a corpus of words that have very strong
visual associations versus very strong [inaudible]
all sorts of other associations and make predictions on a more
general categorical, you know, in spite of those things. So, the first question
is, you know.>>Yeah, yeah, what did you
really learn about language?>>What did you learn
about language? I’m sure you learned some things that surprised you
and what were they?>>Yeah, yeah, that’s great. So, let me answer the second
one first because it’s easy. Yes, we should use
more diverse datasets. And, in fact, we just collected
a thousand word dataset which is a lot easier to do
with MEG than it is with FMRI because of your intra trial
stimulus interval can be much faster. And so, this model is all built on too small a diversity
of words. So, we’re working on that. But, back to you first question, what did we learn
about language? It’s you know, sadly little about how your brain
comprehends words. What we learned instead
is how the neural activity in your brain has
substructure that can allow us to predict the spatial
and perhaps temporal, but at least the spatial
distribution of neural activity. So, it’s really, what
we learned is more about meaning representation
than it is about language comprehension. And, I think, I would say
the same of Jim and Jack, if I had to guess, that
they’ve learned more about meaning representation,
or not meaning representation in some case perceptual
representations and what they are and
how they’re structured than about what are
the steps a brain goes through to generate those
patterns of neural activity. It’s certainly true in our case. But my parsing is that it’s
the same for Jack and Jim.>>Before I let them answer
that which they may want to do, I would, you know, for once I
was very strongly struck by, you know, in your response
maybe this is exactly the thing to compare the brains
of animals, monkeys on versus humans even
though it’s a language test. You wouldn’t present the words
but the associated pictures and then with MEG analysis
see if there was meanings that comprehend in
the same way to see if there is a unique element of human language here
there you’re getting out. Anyway, it’s just
a weird thought.>>Yeah, yeah I’d love
to do this on monkeys.>>I’m wondering about
possible BMI applications. I mean, I think people usually
do BMI with motor cortex output. But, this might be for a lot of
tasks just like a more natural and more distributed
way of getting, you know, intentions out.>>Yeah I think that
that’s probably true. In fact, now that we have
this model, you know, we can set up as a well-defined
algorithmic question what are the 20 concrete nouns that our
model would predict are most English easily distinguished
based on FMRI. And, if you were
going to do some kind of brain computer interface, you
might want to know what our the, from an engineering
point of view, what are the most separated out neural patterns
that you can get. And then, you might even
tag buttons on some device by pictures of those 20 things. And, you know, so I think from
an engineering point of view, that’s not an unreasonable
thing to try.>>What’s kind of intriguing
to me about your approach is that before becoming
familiar with that I thought about problem as kind of
having these two levels. One is the computational level
where people build models that process the stimuli
and predict brain activity. And, the other level is
this so this semantic level where we have predefined
categories but our models don’t really
tell us how the brain analyzes for stimuli, right. So, that’s as far
as on the literature on the category selective
regions. Now, it’s very fascinating that I think the different
people here are evading the explanatory gap between these
two from different sides, right. So, it’s sort of interesting
in the context of your work to think of what level it’s
at because you have this sort of the question like is it fury,
right, this is sort of not easy to build a computational
model that computes that right because if you use a lot of statistical data use the text
corpus information perhaps it can get you an exact
approximation of it. But, the way you do that doesn’t
really model how the brain does it, right. So, it’s sort of
an intermediate, and intermediate level
where you’re opening up all this complexity of semantic representations
making it more concrete and learning from it, learning
about it from the data. But, there’s still some part of
the black boxes, so to speak, that are not very cause to
what the brain is doing, right.>>Right, right it’s
more of a model of what the representations
are not how they get computed in the brain. But again, I think
that’s equally true of the other work we’ve heard
here or not all of the work. But, in particular, I
think closest analogy to what we’re doing is
what Jim was talking about this morning combining
data from multiple people to understand what
substructure there is in the visual cortex neural
patterns and Jacks work on relating movies to
what structure there is in the representations. So, I think, all three of these
are kind of in the same position in that in the same
awkward position in terms of not telling us how but
just telling us what those representations are.>>I look through the
last question here.>>So, are you going to
physically say anything about the data from
uninterestingly like say when people don’t like words?>>Oh how many words.>>Well, you said that people
are presented with words like and pictures and words. But, you also said something about the hard words
like auditory.>>Auditory.>>Oh auditory, right, we
have done some MEG work with auditory stimuli and just
some very preliminary data. And. [ Inaudible ] Well, one thing that makes
spoken language difficult is it takes hundreds of milliseconds
to say refrigerator. So, we don’t know quite how
to align these 50 second, 50 millisecond time windows to the unfolding 300 millisecond
utterance of refrigerator. So, that’s one of
the difficulties. But, we have started to
collect to make data for that. And actually, what we found
mostly was, actually Lela, who’s here somewhere,
there she is, came up with I think
a really clever idea. She said well we don’t know what
the name of all the percepts are but wouldn’t it be
nice to have a test of whether there’s a
perceptual feature here or whether there’s a
semantic feature here. And, she came up with
the following idea. Why don’t we train everywhere in the brain two
different classifiers? One of them will be for a
pair of words like, let’s say, well we’ll do it for
every pair of words, we’ll take say 60 words and
train word pair classifiers for every pair of words. And, we’ll look at the
mean accuracy there. And so then, we’ll get words
like, you know, hammer and house and hammer and screwdriver. We’ll just take the
mean accuracy of all those individual
word pair classifiers. Now, if you think about it, those classifiers
could succeed either by decoding semantic features
or perceptual features. And then, we’ll train a second
set of classifiers that are about word categories, so,
like tools versus buildings. And, those can only generalize
from hammer to screwdriver if they’re actually
using semantic features. So, Lela’s idea was let’s
train both kinds of classifiers and then look for regions
in the brain where one of those does better than
the other and vice versa. And, what we found was, with the
visual line drawing stimuli the place in the brain
where the pair of words classifier did better than the categories was
just occipital cortex. And, in the auditory
presentation it just moved right over to mere auditory
cortex was the only region where he got stronger, substantially stronger
accuracy decoding one pair. So, sorry for the long
winded answer, but, so we’re just starting
to kind of poke around with this
auditory stuff, but. [ Silence ]

Leave a Reply

Your email address will not be published. Required fields are marked *