1
00:00:00,040 --> 00:00:02,410
The following content is
provided under a Creative

2
00:00:02,410 --> 00:00:03,790
Commons license.

3
00:00:03,790 --> 00:00:06,030
Your support will help
MIT OpenCourseWare

4
00:00:06,030 --> 00:00:10,110
continue to offer high quality
educational resources for free.

5
00:00:10,110 --> 00:00:12,680
To make a donation, or to
view additional materials

6
00:00:12,680 --> 00:00:16,590
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,590 --> 00:00:17,260
at ocw.mit.edu.

8
00:00:21,570 --> 00:00:23,655
JEREMY KEPNER: All
right, welcome everybody.

9
00:00:23,655 --> 00:00:28,440
I'm really pleased to
see you all here today

10
00:00:28,440 --> 00:00:32,229
for what will be the first of
a nine lecture course called

11
00:00:32,229 --> 00:00:35,010
signal processing on databases.

12
00:00:35,010 --> 00:00:39,020
And this is a really
special day for me

13
00:00:39,020 --> 00:00:41,685
because this sort of
represents a big step

14
00:00:41,685 --> 00:00:45,310
in a set of technology
that I've been working

15
00:00:45,310 --> 00:00:49,420
on for a couple of
decades, and really

16
00:00:49,420 --> 00:00:53,190
been working intensely on for
the last three or four years.

17
00:00:53,190 --> 00:00:57,150
And this is the
first course on what

18
00:00:57,150 --> 00:00:58,540
we hope will be
a technology that

19
00:00:58,540 --> 00:01:00,890
will become very widely used.

20
00:01:05,030 --> 00:01:08,110
It is a completely unique
course on a completely unique

21
00:01:08,110 --> 00:01:11,110
technology, so I
want to thank you

22
00:01:11,110 --> 00:01:16,980
for being the guinea
pigs of this course.

23
00:01:16,980 --> 00:01:18,800
It's hard enough just
doing a new course

24
00:01:18,800 --> 00:01:20,690
on an existing topic.

25
00:01:20,690 --> 00:01:24,920
Doing a new course on a novel
topic is quite a challenge,

26
00:01:24,920 --> 00:01:27,630
so we're going to be trying
a few different things here.

27
00:01:27,630 --> 00:01:30,700
There are going to be
certain topics covered

28
00:01:30,700 --> 00:01:32,810
from more than one
angle, and we'll

29
00:01:32,810 --> 00:01:37,796
see which ones sort of make
the most sense for you.

30
00:01:37,796 --> 00:01:39,170
The first thing
I'd like to do is

31
00:01:39,170 --> 00:01:41,680
talk a little bit about the
title of the course, which

32
00:01:41,680 --> 00:01:43,480
is a little bit different.

33
00:01:43,480 --> 00:01:46,610
The title is signal
processing on databases,

34
00:01:46,610 --> 00:01:50,470
which is not two phrases, signal
processing and databases, that

35
00:01:50,470 --> 00:01:55,220
are usually connected
together very often.

36
00:01:55,220 --> 00:01:59,400
And so, signal processing,
as we all know,

37
00:01:59,400 --> 00:02:05,160
is, here at Lincoln Labs,
one of the most core things

38
00:02:05,160 --> 00:02:06,230
that we do.

39
00:02:06,230 --> 00:02:10,979
In terms of the technology,
it's almost difficult

40
00:02:10,979 --> 00:02:14,450
to imagine a technology that's
more core to what we do.

41
00:02:14,450 --> 00:02:18,280
The only thing I could think of
was systems engineering, maybe.

42
00:02:18,280 --> 00:02:22,430
But signal processing is really
at the heart of what we do.

43
00:02:22,430 --> 00:02:25,410
Mathematically, signal
processing really,

44
00:02:25,410 --> 00:02:27,800
the heart of that
is detection theory.

45
00:02:27,800 --> 00:02:29,700
How do you find things?

46
00:02:29,700 --> 00:02:33,020
Given sets of data,
how do you eliminate

47
00:02:33,020 --> 00:02:35,130
things you're not looking
for and find the things

48
00:02:35,130 --> 00:02:37,010
that you are looking for?

49
00:02:37,010 --> 00:02:40,850
And then, mathematically,
the heart of detection theory

50
00:02:40,850 --> 00:02:44,807
is linear algebra,
and so, just a show

51
00:02:44,807 --> 00:02:46,890
of hands how many people
have had a linear algebra

52
00:02:46,890 --> 00:02:49,530
course in there?

53
00:02:49,530 --> 00:02:51,650
See, that's really awesome.

54
00:02:51,650 --> 00:02:53,110
See at Lincoln,
that's probably--

55
00:02:53,110 --> 00:02:54,526
this one a few
places in the world

56
00:02:54,526 --> 00:02:56,540
where it was pretty unanimous.

57
00:02:56,540 --> 00:02:59,590
You know, there's not too many
places where that's the case.

58
00:02:59,590 --> 00:03:02,860
And so this technology is
really designed for you,

59
00:03:02,860 --> 00:03:07,350
because it really assumes a
foundation in linear algebra,

60
00:03:07,350 --> 00:03:11,180
and that's not as common
as you might think.

61
00:03:11,180 --> 00:03:13,330
And so people who haven't
had linear algebra,

62
00:03:13,330 --> 00:03:16,259
it's much harder for them to
take advantage of the ideas

63
00:03:16,259 --> 00:03:17,550
that we're going to talk about.

64
00:03:17,550 --> 00:03:22,910
So signal processing, detection
theory, linear algebra,

65
00:03:22,910 --> 00:03:26,070
that's sort of the root
of the first phrase.

66
00:03:26,070 --> 00:03:29,910
The second phrase,
databases, when

67
00:03:29,910 --> 00:03:34,970
we think of databases we
think of typically, maybe

68
00:03:34,970 --> 00:03:41,290
things that you do, searching
on the web, texts, maybe

69
00:03:41,290 --> 00:03:46,310
other things like that,
which is not necessarily

70
00:03:46,310 --> 00:03:49,100
consistent with the mathematics
that we think about when we

71
00:03:49,100 --> 00:03:51,000
think about signal
processing, detection

72
00:03:51,000 --> 00:03:56,030
theory, and linear algebra, we
naturally go onto real numbers,

73
00:03:56,030 --> 00:03:58,510
matrices, those types of things.

74
00:03:58,510 --> 00:04:00,010
So what we're
really doing here is

75
00:04:00,010 --> 00:04:03,500
we're connecting that
mathematics with sort

76
00:04:03,500 --> 00:04:06,160
of a whole new branch of data.

77
00:04:06,160 --> 00:04:10,150
Words and strings
and relationships.

78
00:04:10,150 --> 00:04:12,370
And so that's kind of the
really novel piece of it,

79
00:04:12,370 --> 00:04:14,620
and it's really the core
idea that we're bringing here

80
00:04:14,620 --> 00:04:17,660
is we're going to bring all that
machinery that we know so well,

81
00:04:17,660 --> 00:04:19,700
and we're going to
try and apply it

82
00:04:19,700 --> 00:04:21,820
to this new area which
is becoming increasingly

83
00:04:21,820 --> 00:04:22,870
important for us.

84
00:04:22,870 --> 00:04:24,720
So that's the
explanation of the title.

85
00:04:24,720 --> 00:04:26,770
That's how we picked the title.

86
00:04:26,770 --> 00:04:29,430
There's other
titles, and so I just

87
00:04:29,430 --> 00:04:31,830
wanted to spend a
little time on that.

88
00:04:35,460 --> 00:04:39,240
An enormous number of people
have contributed to this work.

89
00:04:39,240 --> 00:04:43,972
This is a list of the
people that I could remember

90
00:04:43,972 --> 00:04:45,430
at the time that
I typed the slide,

91
00:04:45,430 --> 00:04:48,490
but no doubt I have
forgotten somebody important,

92
00:04:48,490 --> 00:04:50,640
and so I apologize for that.

93
00:04:50,640 --> 00:04:55,200
But certainly these
people, and the whole team

94
00:04:55,200 --> 00:04:59,720
have been very instrumental
in allowing this technology

95
00:04:59,720 --> 00:05:01,340
to move forward here.

96
00:05:07,150 --> 00:05:09,060
So this is going
to be the outline

97
00:05:09,060 --> 00:05:12,670
for this particular lecture.

98
00:05:12,670 --> 00:05:14,450
And there's going to
be a rhythm to most

99
00:05:14,450 --> 00:05:15,950
of these lectures,
which is there's

100
00:05:15,950 --> 00:05:18,960
going to be a lecture, which
is going to be PowerPoint--

101
00:05:18,960 --> 00:05:22,410
I apologize, we all see
plenty of PowerPoint--

102
00:05:22,410 --> 00:05:24,290
mainly to introduce ideas.

103
00:05:24,290 --> 00:05:26,710
And then there will be
sort of a demo, which

104
00:05:26,710 --> 00:05:29,460
will go through code
that is actually

105
00:05:29,460 --> 00:05:34,134
available in your
LLGrid accounts now,

106
00:05:34,134 --> 00:05:35,550
and some of that
code will even be

107
00:05:35,550 --> 00:05:37,610
part of homework
assignments, if you so choose

108
00:05:37,610 --> 00:05:39,180
to do the homework assignments.

109
00:05:39,180 --> 00:05:40,420
And so that's going
to be the flavor.

110
00:05:40,420 --> 00:05:42,350
We'll do the slides,
we'll take a short break,

111
00:05:42,350 --> 00:05:46,995
and then we'll do the
examples and talk about them.

112
00:05:51,140 --> 00:05:53,580
So let me talk about
some of the kinds of--

113
00:05:53,580 --> 00:05:55,740
and I stole this slide--
let me talk about some

114
00:05:55,740 --> 00:05:59,000
of the kinds of data that
the technology that we're

115
00:05:59,000 --> 00:06:06,370
going to be talking about
today really, really enables.

116
00:06:06,370 --> 00:06:08,710
It allows us to
deal with data that

117
00:06:08,710 --> 00:06:14,390
has to do with
relationships, relationships

118
00:06:14,390 --> 00:06:17,560
particularly as those
represented by graphs.

119
00:06:17,560 --> 00:06:22,900
So for example, you might
have vehicle tracks.

120
00:06:22,900 --> 00:06:28,410
Those represent relationships
between locations.

121
00:06:28,410 --> 00:06:31,190
A car goes from one
place to another.

122
00:06:31,190 --> 00:06:35,400
That's a relationship
between those locations.

123
00:06:35,400 --> 00:06:37,950
Another is social networks.

124
00:06:37,950 --> 00:06:40,320
You know, everyone wants to
know who their friends are,

125
00:06:40,320 --> 00:06:43,020
how many friends they are, who
they're friends of friends are,

126
00:06:43,020 --> 00:06:44,570
those types of things.

127
00:06:44,570 --> 00:06:47,870
So that's another type
of set of relationship.

128
00:06:47,870 --> 00:06:52,500
Again, very textural and
oriented, very much oriented

129
00:06:52,500 --> 00:06:55,640
towards words and not numbers.

130
00:06:55,640 --> 00:07:01,830
Another area is cyber
networks, a very apropros topic

131
00:07:01,830 --> 00:07:04,720
that people are, at least
everyone is impacted by,

132
00:07:04,720 --> 00:07:08,730
and many people are doing work
on, which is relationships

133
00:07:08,730 --> 00:07:11,670
between computers, communication
between web servers,

134
00:07:11,670 --> 00:07:13,330
those types of things.

135
00:07:13,330 --> 00:07:19,690
And again, that is a very
important area for us.

136
00:07:23,120 --> 00:07:25,390
Just to give you sort
of a big picture here,

137
00:07:25,390 --> 00:07:27,079
like well, why am I even here?

138
00:07:27,079 --> 00:07:28,120
Why do I care about this?

139
00:07:28,120 --> 00:07:30,134
What are the things
that we can do?

140
00:07:30,134 --> 00:07:32,300
I'm going to kind of walk
through an example of some

141
00:07:32,300 --> 00:07:35,060
of the things that we have
used this technology for that

142
00:07:35,060 --> 00:07:37,000
have been very exciting for us.

143
00:07:37,000 --> 00:07:40,430
And so, trying-- and
we'll have examples

144
00:07:40,430 --> 00:07:42,610
like that throughout
the course, you

145
00:07:42,610 --> 00:07:45,160
know, how you can
actually use this,

146
00:07:45,160 --> 00:07:47,200
how this has had
really a real impact.

147
00:07:47,200 --> 00:07:49,150
And so the first one
we want to talk about

148
00:07:49,150 --> 00:07:53,860
is some work we've done
in the cyber domain.

149
00:07:53,860 --> 00:07:55,930
So here's an example
of a data set,

150
00:07:55,930 --> 00:08:00,610
which is a graph
that is web traffic.

151
00:08:00,610 --> 00:08:05,540
And so, if we see here,
this is the source IPs.

152
00:08:05,540 --> 00:08:08,510
So, you can imagine these
are the computers that

153
00:08:08,510 --> 00:08:11,170
are on the inside
of some domain,

154
00:08:11,170 --> 00:08:13,135
and then these are
the destination IPs,

155
00:08:13,135 --> 00:08:17,780
all web servers that they hit
are the green ones around here.

156
00:08:17,780 --> 00:08:20,090
This is sometimes referred
to in graph theory

157
00:08:20,090 --> 00:08:21,430
as a bipartite graph.

158
00:08:21,430 --> 00:08:23,630
It's basically
two separate sets,

159
00:08:23,630 --> 00:08:26,180
a vertices that
talk to each other

160
00:08:26,180 --> 00:08:28,860
but have no, at least
in this data set,

161
00:08:28,860 --> 00:08:33,250
we are not recording any
connections within the sets.

162
00:08:33,250 --> 00:08:36,750
And then, hopefully you can
see here these faint blue lines

163
00:08:36,750 --> 00:08:40,370
here that are the actual
connections that show you

164
00:08:40,370 --> 00:08:44,360
which of these
computers are connecting

165
00:08:44,360 --> 00:08:46,330
to which of these computers.

166
00:08:46,330 --> 00:08:49,240
There's obviously a lot
of data, web traffic data.

167
00:08:49,240 --> 00:08:50,680
There's enormous amount of it.

168
00:08:50,680 --> 00:08:57,030
It shows 90 minutes of data,
and in this particular case

169
00:08:57,030 --> 00:09:00,010
there was actually
malicious activity

170
00:09:00,010 --> 00:09:03,160
that we were looking to
detect in this data set.

171
00:09:06,230 --> 00:09:10,060
So one of the biggest challenges
that you have in any data

172
00:09:10,060 --> 00:09:13,400
set like this is
data representation.

173
00:09:13,400 --> 00:09:15,680
And sometimes in
database parlance

174
00:09:15,680 --> 00:09:17,660
this is called the schema.

175
00:09:17,660 --> 00:09:20,784
So how you represent
your data is usually

176
00:09:20,784 --> 00:09:22,200
if you can figure
that out, you're

177
00:09:22,200 --> 00:09:26,720
almost halfway to
solving the problem.

178
00:09:26,720 --> 00:09:30,120
And so, given
that, in this case,

179
00:09:30,120 --> 00:09:32,260
we wanted to use
graph techniques,

180
00:09:32,260 --> 00:09:34,110
we discover that our
data almost always

181
00:09:34,110 --> 00:09:37,340
doesn't come to us in
a nicely formed graph,

182
00:09:37,340 --> 00:09:40,840
and so we have to go
through a lot of steps here.

183
00:09:40,840 --> 00:09:43,640
And so one thing you'll
see is with this technology

184
00:09:43,640 --> 00:09:47,020
is that there's almost always
possible part of a pipeline

185
00:09:47,020 --> 00:09:52,720
from raw data to conditioning
the raw data to inserting it

186
00:09:52,720 --> 00:09:57,207
into a database to doing
analytics to doing queries

187
00:09:57,207 --> 00:09:58,290
and other types of things.

188
00:09:58,290 --> 00:10:00,864
You're always really going
to be using a pipeline.

189
00:10:00,864 --> 00:10:03,280
And you're going to be wanting
to use different technology

190
00:10:03,280 --> 00:10:05,807
tools for different
stages in the pipeline.

191
00:10:05,807 --> 00:10:07,390
We're not in any way
trying to suggest

192
00:10:07,390 --> 00:10:09,348
that the technologies
we're talking about today

193
00:10:09,348 --> 00:10:10,960
are the only pieces.

194
00:10:10,960 --> 00:10:12,710
It's an important
piece, the technologies

195
00:10:12,710 --> 00:10:15,030
we're talking about, but
they're not the only pieces.

196
00:10:15,030 --> 00:10:16,780
And you're always
really functioning

197
00:10:16,780 --> 00:10:18,080
in the context of a pipeline.

198
00:10:22,580 --> 00:10:24,620
So this just shows
the technology

199
00:10:24,620 --> 00:10:26,410
stack that we will
be leveraging,

200
00:10:26,410 --> 00:10:29,490
that we typically leverage
in these types of things.

201
00:10:29,490 --> 00:10:31,460
Ultimately, our goal
is to get up here

202
00:10:31,460 --> 00:10:33,092
to these graph analytics.

203
00:10:35,366 --> 00:10:37,240
One of the things we'll
be talking about most

204
00:10:37,240 --> 00:10:39,780
in this course is the
high level languages

205
00:10:39,780 --> 00:10:44,420
that allow you to essentially
do graph analytics very easily

206
00:10:44,420 --> 00:10:48,700
and bridge between the
databases themselves,

207
00:10:48,700 --> 00:10:51,460
and the high performance
computing that you will need.

208
00:10:51,460 --> 00:10:55,820
So again, not only do you have
a pipeline of software and data,

209
00:10:55,820 --> 00:10:58,640
you have a pipeline or
stack of technology.

210
00:10:58,640 --> 00:11:00,730
So when you're addressing
problems like this,

211
00:11:00,730 --> 00:11:02,510
usually you're
starting at one place

212
00:11:02,510 --> 00:11:05,110
but you will almost always,
when you build a system,

213
00:11:05,110 --> 00:11:07,700
work out and deal with
these other types of things.

214
00:11:11,040 --> 00:11:13,350
So that's sort of just a
quick overview and example

215
00:11:13,350 --> 00:11:16,299
and I don't expect any of you--
it's probably just kind of,

216
00:11:16,299 --> 00:11:17,840
the purpose of the
example's probably

217
00:11:17,840 --> 00:11:19,030
to raise more questions.

218
00:11:19,030 --> 00:11:21,261
Well, how do you really do this?

219
00:11:21,261 --> 00:11:23,510
So now I'm going to sort of
take a bit of a right turn

220
00:11:23,510 --> 00:11:26,680
here and really talk about
what the course is about, OK.

221
00:11:26,680 --> 00:11:28,860
And I'm going to actually
talk about the course

222
00:11:28,860 --> 00:11:30,980
a little bit of a big picture.

223
00:11:30,980 --> 00:11:33,710
And I always felt that
every course should tell you

224
00:11:33,710 --> 00:11:37,810
why that course is the most
important course you'll ever

225
00:11:37,810 --> 00:11:39,149
take in your career.

226
00:11:39,149 --> 00:11:40,690
You know, I think
every single course

227
00:11:40,690 --> 00:11:42,120
you can come up with
an argument for that,

228
00:11:42,120 --> 00:11:43,120
but people don't say it.

229
00:11:43,120 --> 00:11:45,525
So I'm going to tell
you why this course is

230
00:11:45,525 --> 00:11:47,400
the most important course
you will ever take.

231
00:11:47,400 --> 00:11:49,210
Of course, every course is
the most important course

232
00:11:49,210 --> 00:11:52,060
you'll ever take, but I'm going
to tell you why this one is.

233
00:11:52,060 --> 00:11:54,980
So stepping way back,
we're all part of MIT here

234
00:11:54,980 --> 00:11:58,900
and MIT has a
formula for success.

235
00:11:58,900 --> 00:12:01,030
MIT is the number one
science and engineering

236
00:12:01,030 --> 00:12:03,830
university on earth.

237
00:12:03,830 --> 00:12:07,150
We can actually, by
any reasonable measure

238
00:12:07,150 --> 00:12:09,560
that is true.

239
00:12:09,560 --> 00:12:15,470
And in fact, by the people who
track this, it's true by a lot.

240
00:12:15,470 --> 00:12:17,530
And so you know,
it's a real honor

241
00:12:17,530 --> 00:12:19,540
to be associated with
this organization,

242
00:12:19,540 --> 00:12:22,850
to be at an organization
that's at its zenith.

243
00:12:22,850 --> 00:12:27,840
And MIT has had this incredible
run of success over the last 60

244
00:12:27,840 --> 00:12:31,570
years because fundamentally
they recognize the formula

245
00:12:31,570 --> 00:12:34,090
for making discoveries is
a combination of theory

246
00:12:34,090 --> 00:12:35,920
and experiment.

247
00:12:35,920 --> 00:12:39,800
Bring those two together,
and as an organization,

248
00:12:39,800 --> 00:12:43,030
we've actually
organized ourselves,

249
00:12:43,030 --> 00:12:47,240
MIT is organized around this
formula in that we have theory

250
00:12:47,240 --> 00:12:50,240
and experiment, and theory
is the academic part

251
00:12:50,240 --> 00:12:53,120
of the mission.

252
00:12:53,120 --> 00:12:56,640
That's run out of departments.

253
00:12:56,640 --> 00:12:58,546
Typically, it
involves mathematics.

254
00:12:58,546 --> 00:13:00,170
This is a place-- I
mean, everyone here

255
00:13:00,170 --> 00:13:02,110
raised their hands
to linear algebra,

256
00:13:02,110 --> 00:13:06,060
and I can go to any part of
MIT and that will be true.

257
00:13:06,060 --> 00:13:08,700
We probably have the highest
mathematical literacy

258
00:13:08,700 --> 00:13:13,310
of any organization on earth,
and that's a very powerful,

259
00:13:13,310 --> 00:13:15,500
powerful capability.

260
00:13:15,500 --> 00:13:20,370
The mathematics that we use,
they become manifest or real,

261
00:13:20,370 --> 00:13:21,970
typically in the
forms of algorithms,

262
00:13:21,970 --> 00:13:26,130
and our algorithms become
real in the form of software.

263
00:13:26,130 --> 00:13:28,540
Likewise, on the
experimental side, you know,

264
00:13:28,540 --> 00:13:32,220
that's the research
that we conduct.

265
00:13:32,220 --> 00:13:38,130
MIT actually has
large laboratories.

266
00:13:38,130 --> 00:13:41,550
Laboratories that are fully
peered with the departments.

267
00:13:41,550 --> 00:13:44,120
That is actually
quite a unique thing

268
00:13:44,120 --> 00:13:46,680
for an academic institution
to have laboratories

269
00:13:46,680 --> 00:13:49,060
and laboratory heads
that are right up

270
00:13:49,060 --> 00:13:52,690
there with the department heads
in how they're positioned.

271
00:13:52,690 --> 00:13:56,470
Lincoln being the largest of
them, CSAIL, the Media Lab,

272
00:13:56,470 --> 00:13:58,170
there's a number of them.

273
00:13:58,170 --> 00:13:59,610
So we have this thing.

274
00:13:59,610 --> 00:14:01,270
And the focus of
research, of course,

275
00:14:01,270 --> 00:14:04,870
is measurement,
which is data, which

276
00:14:04,870 --> 00:14:07,230
often gets reduced to bytes.

277
00:14:07,230 --> 00:14:12,530
And so, often implementing
this MIT formula

278
00:14:12,530 --> 00:14:16,990
comes down to marrying
software and bytes together

279
00:14:16,990 --> 00:14:17,750
in a computer.

280
00:14:17,750 --> 00:14:21,590
That's sort of where we bring
these two ideas together.

281
00:14:21,590 --> 00:14:24,040
And so we're going to be
talking a lot about that.

282
00:14:24,040 --> 00:14:28,040
Implementing software
bringing bytes together

283
00:14:28,040 --> 00:14:30,950
analyzing together that's where
these two things come together,

284
00:14:30,950 --> 00:14:33,950
and I think that's
really, really important.

285
00:14:33,950 --> 00:14:35,270
Now what's the tool we use?

286
00:14:35,270 --> 00:14:37,680
What's the machine that we
use to bring these together?

287
00:14:37,680 --> 00:14:39,830
It's a computer, all right.

288
00:14:39,830 --> 00:14:42,130
And nowadays, computers
have gotten a little bit

289
00:14:42,130 --> 00:14:45,269
more complicated than the way
we used to think about them.

290
00:14:45,269 --> 00:14:46,810
We used to think
about them typically

291
00:14:46,810 --> 00:14:48,490
as a von Neumann
machine, which means

292
00:14:48,490 --> 00:14:52,420
you had data and operations,
you know, bytes and software.

293
00:14:52,420 --> 00:14:55,480
And the bytes go into
a part of the computer,

294
00:14:55,480 --> 00:14:57,310
they get operated
on, they come out.

295
00:14:57,310 --> 00:14:59,970
And that model is still
true at a high level,

296
00:14:59,970 --> 00:15:02,140
but the computers we
tend to deal with now

297
00:15:02,140 --> 00:15:04,400
are much, much, much
more complicated.

298
00:15:04,400 --> 00:15:08,020
So on the left is a
standard parallel computer.

299
00:15:08,020 --> 00:15:11,060
Almost every computer used
today looks like this.

300
00:15:11,060 --> 00:15:12,890
You have processors,
you have memory,

301
00:15:12,890 --> 00:15:16,377
you have some kind of
persistent storage,

302
00:15:16,377 --> 00:15:18,210
you have a network
connecting them together,

303
00:15:18,210 --> 00:15:22,200
and this form a very complicated
what we call memory hierarchy.

304
00:15:22,200 --> 00:15:27,590
It begins with registers
here at the top, caches,

305
00:15:27,590 --> 00:15:29,700
local memory, remote
memory, and then storage.

306
00:15:29,700 --> 00:15:32,410
And these are the units
that go between them.

307
00:15:32,410 --> 00:15:35,970
And there's a lot of
implications of this hierarchy,

308
00:15:35,970 --> 00:15:37,850
in terms of what
things are-- you know,

309
00:15:37,850 --> 00:15:42,230
bandwidth goes up as you go
this way, latency goes down.

310
00:15:42,230 --> 00:15:44,830
Programmability generally
goes up the closer you are.

311
00:15:44,830 --> 00:15:46,550
You don't have to
worry about things.

312
00:15:46,550 --> 00:15:48,120
And likewise, the
capacity of data

313
00:15:48,120 --> 00:15:52,200
goes down as you go traverse
this way in a hierarchy.

314
00:15:52,200 --> 00:15:55,110
So all modern
computers that you use

315
00:15:55,110 --> 00:15:56,670
are these von
Neumann architectures

316
00:15:56,670 --> 00:15:58,727
with multi-level
memory hierarchy.

317
00:15:58,727 --> 00:16:00,560
And fundamentally the
thing you have to know

318
00:16:00,560 --> 00:16:05,596
is that this architecture
selects the algorithms you use.

319
00:16:05,596 --> 00:16:08,620
If algorithms are not going to
run well in this architecture,

320
00:16:08,620 --> 00:16:11,730
you probably don't
implement them.

321
00:16:11,730 --> 00:16:15,130
So one has to be very much
aware of one's algorithms

322
00:16:15,130 --> 00:16:16,890
with respect to
this architecture,

323
00:16:16,890 --> 00:16:21,200
because if you pick algorithms
that aren't going to run well

324
00:16:21,200 --> 00:16:27,280
in this architecture, it's going
to be very, very difficult.

325
00:16:27,280 --> 00:16:32,490
The goal of this
course is to teach you

326
00:16:32,490 --> 00:16:37,720
techniques that will allow you
to get your work done faster.

327
00:16:37,720 --> 00:16:40,910
And so this is a rather
complicated drawing here,

328
00:16:40,910 --> 00:16:46,940
but this shows, given a
problem that you want to solve,

329
00:16:46,940 --> 00:16:49,290
if you implement it
in software using

330
00:16:49,290 --> 00:16:54,310
a variety of techniques,
what the performance

331
00:16:54,310 --> 00:16:55,270
of those techniques is.

332
00:16:55,270 --> 00:16:57,980
So in a certain sense, you could
view this as the volume of code

333
00:16:57,980 --> 00:17:00,075
that you have to write
to solve a problem.

334
00:17:00,075 --> 00:17:02,320
And then we have implementing
the program and C

335
00:17:02,320 --> 00:17:04,089
as a reference point.

336
00:17:04,089 --> 00:17:06,099
So as you move over
here and you do

337
00:17:06,099 --> 00:17:10,380
things like in assembly
or Java, MapReduce,

338
00:17:10,380 --> 00:17:12,420
you end up writing more code.

339
00:17:12,420 --> 00:17:15,770
If you move over here and
do things in C++ or Java

340
00:17:15,770 --> 00:17:19,010
or MATLAB, you write less code,
and this shows you the relative

341
00:17:19,010 --> 00:17:20,359
performance.

342
00:17:20,359 --> 00:17:22,710
Again, as you do
things in lower level

343
00:17:22,710 --> 00:17:25,829
environments like assembly,
you can get more performance.

344
00:17:25,829 --> 00:17:28,860
And as you do things in higher
level environments like MATLAB,

345
00:17:28,860 --> 00:17:30,700
you get less performance.

346
00:17:30,700 --> 00:17:34,150
And this is a fairly general
trade off space here.

347
00:17:34,150 --> 00:17:36,850
If we add parallel
programming models here,

348
00:17:36,850 --> 00:17:39,280
of which there is many, we
have direct memory access,

349
00:17:39,280 --> 00:17:44,040
we have message passing, we
have a manager worker style

350
00:17:44,040 --> 00:17:47,700
or map reduce style, we have
a whole different styles

351
00:17:47,700 --> 00:17:50,150
of programming here.

352
00:17:50,150 --> 00:17:52,160
We're going to want
to try and be here.

353
00:17:52,160 --> 00:17:54,570
We're going to want to do
things that take less effort

354
00:17:54,570 --> 00:17:56,520
and deliver you
more performance.

355
00:17:56,520 --> 00:17:59,270
And so, we are going
to be mostly focusing

356
00:17:59,270 --> 00:18:01,940
on MATLAB in this
class and a technology

357
00:18:01,940 --> 00:18:05,770
we call [INAUDIBLE] to
describe that more later.

358
00:18:05,770 --> 00:18:08,950
But this is a
software combination

359
00:18:08,950 --> 00:18:14,110
that allows you to get
relatively good performance

360
00:18:14,110 --> 00:18:16,450
at a significantly
reduced effort,

361
00:18:16,450 --> 00:18:18,960
and that's really the only thing
we're doing in this course.

362
00:18:18,960 --> 00:18:22,230
We're just saying, if you're
doing these types of problems,

363
00:18:22,230 --> 00:18:24,040
there's a right tool
for the job or there

364
00:18:24,040 --> 00:18:25,720
are right tools for
the job, and if you

365
00:18:25,720 --> 00:18:27,840
use the right tool for
the job, it will take you

366
00:18:27,840 --> 00:18:29,880
less time to get the job done.

367
00:18:29,880 --> 00:18:33,210
And we're going to try and
teach you about these tools

368
00:18:33,210 --> 00:18:36,480
and how they actually work.

369
00:18:36,480 --> 00:18:39,110
Now, a little bit of
bait and switch here.

370
00:18:39,110 --> 00:18:41,630
We call the course signal
processing databases,

371
00:18:41,630 --> 00:18:44,980
but we're going to get to
databases at the very end,

372
00:18:44,980 --> 00:18:48,760
unfortunately, to
disappoint you all.

373
00:18:48,760 --> 00:18:53,090
And the reason is
because databases

374
00:18:53,090 --> 00:18:54,700
are good for one
type of problem,

375
00:18:54,700 --> 00:18:57,910
but it's often at the
very end of the game

376
00:18:57,910 --> 00:19:00,380
where you really need
to work with databases.

377
00:19:00,380 --> 00:19:03,290
So to understand, we have
a variety of different ways

378
00:19:03,290 --> 00:19:07,320
we can solve problems
with algorithms, OK.

379
00:19:07,320 --> 00:19:11,440
We can solve our problems
just using a single computer's

380
00:19:11,440 --> 00:19:12,600
memory.

381
00:19:12,600 --> 00:19:17,160
We can solve our
problems using the memory

382
00:19:17,160 --> 00:19:18,450
on multiple computers.

383
00:19:18,450 --> 00:19:23,920
We can solve our problems
using disks or storage, or even

384
00:19:23,920 --> 00:19:27,030
with many disks, OK,
and as we go this way,

385
00:19:27,030 --> 00:19:29,780
we can do more and
more and more data.

386
00:19:29,780 --> 00:19:32,360
But one of the things that
really matters, and it kind of

387
00:19:32,360 --> 00:19:35,690
goes back to that other
chart there is well,

388
00:19:35,690 --> 00:19:39,770
when I'm working with this
data, what is the chunk size?

389
00:19:39,770 --> 00:19:43,990
What are the blocks that
I need to get the data in?

390
00:19:43,990 --> 00:19:51,060
And if I need to get the data
in very, very small chunks,

391
00:19:51,060 --> 00:19:53,140
well then obviously
having the data in memory

392
00:19:53,140 --> 00:19:55,230
is going to be the most
efficient thing I can do.

393
00:19:55,230 --> 00:20:00,190
If I want to just get one
record out of many records,

394
00:20:00,190 --> 00:20:05,370
then that's the most efficient,
having the data in memory.

395
00:20:05,370 --> 00:20:10,420
Likewise-- so that's where
we typically start, right?

396
00:20:10,420 --> 00:20:13,600
If our problem is small when
we almost always try and start

397
00:20:13,600 --> 00:20:17,370
with our problems with small,
we're going to be working here.

398
00:20:17,370 --> 00:20:21,880
We're going to be writing what
we call serial programs on one

399
00:20:21,880 --> 00:20:23,200
computer.

400
00:20:23,200 --> 00:20:25,270
And then, the great
thing about serial memory

401
00:20:25,270 --> 00:20:31,030
is that even as our
requests size gets larger,

402
00:20:31,030 --> 00:20:34,390
they're really not going to
do anything better than just

403
00:20:34,390 --> 00:20:35,670
having the data in memory.

404
00:20:35,670 --> 00:20:39,150
So we'll be able to grow
that model up to data sets

405
00:20:39,150 --> 00:20:44,450
that are fairly small, maybe
gigabytes, and we can continue

406
00:20:44,450 --> 00:20:46,870
to use that model just fine.

407
00:20:46,870 --> 00:20:49,390
And then, if we
decide, well, we're

408
00:20:49,390 --> 00:20:54,220
really moving beyond that,
we need to do more data,

409
00:20:54,220 --> 00:20:58,880
we can go to a parallel
program to get more memory,

410
00:20:58,880 --> 00:21:01,940
or we can use data in files.

411
00:21:01,940 --> 00:21:03,580
We can write our
data to files, we

412
00:21:03,580 --> 00:21:05,663
can read them in, process
them, and read them out,

413
00:21:05,663 --> 00:21:07,640
and that model works
very, very well.

414
00:21:07,640 --> 00:21:10,440
And in fact,
particularly if we're

415
00:21:10,440 --> 00:21:12,874
going to be accessing
the data in large chunks.

416
00:21:12,874 --> 00:21:14,290
If I have a lot
of data and like I

417
00:21:14,290 --> 00:21:18,030
want to do some processing
on the whole piece of it.

418
00:21:21,170 --> 00:21:23,680
Likewise, if we go to
a very large data sets,

419
00:21:23,680 --> 00:21:27,240
we find that we can even do
parallel programs writing files

420
00:21:27,240 --> 00:21:30,520
in parallel if we want to do
extreme large problems that we

421
00:21:30,520 --> 00:21:33,820
want to traverse all the data.

422
00:21:33,820 --> 00:21:37,820
But finally, we're going
to come back to this case

423
00:21:37,820 --> 00:21:39,680
where we need a database.

424
00:21:39,680 --> 00:21:42,350
And a database is when we have
very large amounts of data

425
00:21:42,350 --> 00:21:44,900
but we want to
access-- database,

426
00:21:44,900 --> 00:21:47,250
data that won't
fit in our memory,

427
00:21:47,250 --> 00:21:49,160
but we want to access
in little bits.

428
00:21:49,160 --> 00:21:50,770
We want to look up something.

429
00:21:50,770 --> 00:21:54,720
And that's really the only
time we really need a database,

430
00:21:54,720 --> 00:21:58,520
unless of course someone
says this is your data.

431
00:21:58,520 --> 00:22:00,260
It is available to
you on a database.

432
00:22:00,260 --> 00:22:02,790
That's the only way
you're going to access it.

433
00:22:02,790 --> 00:22:05,270
But if you have
control-- so you always

434
00:22:05,270 --> 00:22:06,650
want to follow this path.

435
00:22:06,650 --> 00:22:09,100
You want to start with
your serial program memory,

436
00:22:09,100 --> 00:22:12,570
maybe go to a parallel
program using files, maybe

437
00:22:12,570 --> 00:22:15,590
parallel files, and then
the database is almost

438
00:22:15,590 --> 00:22:17,850
the two of last
resort because it

439
00:22:17,850 --> 00:22:22,430
adds a great deal of complexity,
and while it can look up

440
00:22:22,430 --> 00:22:25,712
things much quicker, if you
actually end up doing something

441
00:22:25,712 --> 00:22:27,170
where you're going
to be traversing

442
00:22:27,170 --> 00:22:28,628
a significant
fraction of the data,

443
00:22:28,628 --> 00:22:31,280
you will find it's actually
significantly slower

444
00:22:31,280 --> 00:22:33,370
than if you have
the data on files.

445
00:22:33,370 --> 00:22:35,610
So we're going to
show you techniques,

446
00:22:35,610 --> 00:22:38,980
we're going to follow this path,
and then hopefully by the end,

447
00:22:38,980 --> 00:22:40,970
you'll be like,
oh, well yes, now I

448
00:22:40,970 --> 00:22:43,070
know why we use a database
for these instances,

449
00:22:43,070 --> 00:22:44,510
but I realize that
I can get most

450
00:22:44,510 --> 00:22:46,850
of my work done using
these other techniques that

451
00:22:46,850 --> 00:22:49,670
work just fine.

452
00:22:49,670 --> 00:22:51,570
So hopefully you're
going to be learning

453
00:22:51,570 --> 00:22:53,840
what we call the fast path.

454
00:22:53,840 --> 00:22:58,840
So if you want to get a
certain level of performance,

455
00:22:58,840 --> 00:23:05,260
OK, if you have no training
or minimal training,

456
00:23:05,260 --> 00:23:07,580
this is the path
that you follow.

457
00:23:07,580 --> 00:23:10,830
You began using a technology.

458
00:23:10,830 --> 00:23:13,400
Your program gets
slower for a while.

459
00:23:13,400 --> 00:23:16,740
You beat your head against the
wall and the web for a while.

460
00:23:16,740 --> 00:23:18,540
Maybe eventually,
after days, you

461
00:23:18,540 --> 00:23:21,260
get back to where you
were, and then slowly you

462
00:23:21,260 --> 00:23:24,881
climb up here to get it to
some level of performance weeks

463
00:23:24,881 --> 00:23:25,380
later.

464
00:23:25,380 --> 00:23:29,400
This is something we
observe time and time again,

465
00:23:29,400 --> 00:23:33,210
and typically what happens
is that right around here

466
00:23:33,210 --> 00:23:35,035
people just give up
with the technology

467
00:23:35,035 --> 00:23:38,830
and they say the technology
didn't work for them.

468
00:23:38,830 --> 00:23:41,490
The point of this class is
to give you the techniques

469
00:23:41,490 --> 00:23:43,450
that an expert would have.

470
00:23:43,450 --> 00:23:45,600
An expert could
use the technology,

471
00:23:45,600 --> 00:23:49,670
know exactly how to use it,
and then can go on this path

472
00:23:49,670 --> 00:23:52,170
where basically within hours
you're getting good performance

473
00:23:52,170 --> 00:23:54,340
and if they really want to
take it to the next level

474
00:23:54,340 --> 00:23:56,260
they can spend days doing that.

475
00:23:56,260 --> 00:23:59,900
And so that's kind of our
whole philosophy here.

476
00:23:59,900 --> 00:24:03,280
So we're going to actually
have some new concepts

477
00:24:03,280 --> 00:24:04,150
in this course.

478
00:24:04,150 --> 00:24:09,960
I think they're talked
about in the course

479
00:24:09,960 --> 00:24:14,410
outline a little bit, some
of the really new ideas

480
00:24:14,410 --> 00:24:20,170
that you probably wouldn't have
run across in other courses.

481
00:24:20,170 --> 00:24:22,700
So we're going to be
talking about graphs a lot

482
00:24:22,700 --> 00:24:25,360
in this course.

483
00:24:25,360 --> 00:24:28,160
The standard graph
that-- how many people

484
00:24:28,160 --> 00:24:32,160
have taken a graph theory course
or a computer science course

485
00:24:32,160 --> 00:24:34,690
that talked about graphs?

486
00:24:34,690 --> 00:24:35,460
Most.

487
00:24:35,460 --> 00:24:40,570
I'm glad to see linear algebra
is still the stronger one here.

488
00:24:40,570 --> 00:24:42,962
That's more important.

489
00:24:42,962 --> 00:24:45,170
Between the two, having the
linear algebra background

490
00:24:45,170 --> 00:24:46,957
is the more important.

491
00:24:46,957 --> 00:24:48,540
And I can, if you
know linear algebra,

492
00:24:48,540 --> 00:24:52,060
I can teach you graph
theory pretty easily.

493
00:24:52,060 --> 00:24:54,170
The traditional
definitions of graphs,

494
00:24:54,170 --> 00:24:56,660
though, that are taught in
most computer science courses,

495
00:24:56,660 --> 00:25:02,990
they tend to focus
on what we call

496
00:25:02,990 --> 00:25:06,850
undirected, unweighted graphs.

497
00:25:06,850 --> 00:25:10,381
So these are a graph is a
set of vertices and edges,

498
00:25:10,381 --> 00:25:11,880
and so what we're
saying when we say

499
00:25:11,880 --> 00:25:13,810
it's an undirected
and unweighted graph

500
00:25:13,810 --> 00:25:17,584
is that all the edges are the
same between any two vertices.

501
00:25:17,584 --> 00:25:19,250
There's no real
difference between them.

502
00:25:19,250 --> 00:25:21,430
Either an edge
exists or it doesn't.

503
00:25:21,430 --> 00:25:23,235
It's kind of like
a zero or a one.

504
00:25:23,235 --> 00:25:24,610
And then it doesn't
really-- when

505
00:25:24,610 --> 00:25:26,890
we say two vertices
connected, there's

506
00:25:26,890 --> 00:25:29,880
no directionality of that.

507
00:25:29,880 --> 00:25:34,010
The majority of graph theory
and what's taught in class

508
00:25:34,010 --> 00:25:36,670
is based on that.

509
00:25:36,670 --> 00:25:37,260
Excuse me.

510
00:25:37,260 --> 00:25:49,652
Unfortunately, when we
get into the real world--

511
00:25:49,652 --> 00:25:51,610
unfortunately, when we
get into the real world,

512
00:25:51,610 --> 00:25:54,930
we find that I've never
run into that kind of graph

513
00:25:54,930 --> 00:25:56,214
in the real world.

514
00:25:56,214 --> 00:25:57,630
In the years I've
been doing this,

515
00:25:57,630 --> 00:26:01,904
I've never actually run
into that kind of graph.

516
00:26:01,904 --> 00:26:03,570
In fact, another part
of that is they'll

517
00:26:03,570 --> 00:26:04,930
talk about the distribution.

518
00:26:04,930 --> 00:26:06,700
In fact, you'll
often hear something

519
00:26:06,700 --> 00:26:08,770
called an Erdos-Renyi
graph, which

520
00:26:08,770 --> 00:26:10,950
is just a random connection.

521
00:26:10,950 --> 00:26:14,170
It is any two vertices
are randomly connected

522
00:26:14,170 --> 00:26:15,520
with maybe some probability.

523
00:26:18,410 --> 00:26:21,640
And again, I've never run into
that either in the real world.

524
00:26:21,640 --> 00:26:24,270
So the focus of graph
theory that we have now

525
00:26:24,270 --> 00:26:26,100
that we have all this
data and can compare

526
00:26:26,100 --> 00:26:27,940
the theory and
the data, can bury

527
00:26:27,940 --> 00:26:31,810
the theory of the experiment, we
see that the theory is wanting.

528
00:26:31,810 --> 00:26:36,830
The theory is not a good
stepping stone into the data.

529
00:26:36,830 --> 00:26:42,690
And so what we see is that
the real data is not random.

530
00:26:42,690 --> 00:26:44,900
It's often what we call
a power law distribution,

531
00:26:44,900 --> 00:26:47,350
that a certain vertices
are very popular

532
00:26:47,350 --> 00:26:52,420
and most vertices don't have
a lot of connections to them.

533
00:26:52,420 --> 00:26:55,780
Usually the edges are directed,
that is, there's meaning.

534
00:26:58,410 --> 00:27:00,240
If I friend you,
it's not the same

535
00:27:00,240 --> 00:27:05,420
as you friending me on Facebook
or Twitter, or you twit face,

536
00:27:05,420 --> 00:27:10,570
or whatever the technology
is of the future.

537
00:27:10,570 --> 00:27:13,240
Usually the edge of self
has meaning, has weight.

538
00:27:13,240 --> 00:27:14,650
That weight could be of value.

539
00:27:14,650 --> 00:27:17,086
It could be words, it could
be other types of things.

540
00:27:20,830 --> 00:27:22,390
Typically you have
multiple edges.

541
00:27:22,390 --> 00:27:24,402
I could send you
many friend requests.

542
00:27:24,402 --> 00:27:26,610
You know, that's not the
same thing as me sending you

543
00:27:26,610 --> 00:27:27,856
one friend request.

544
00:27:27,856 --> 00:27:29,480
Maybe I can't send
you friend requests.

545
00:27:29,480 --> 00:27:30,910
I have multiple friend requests.

546
00:27:30,910 --> 00:27:32,020
I don't know.

547
00:27:32,020 --> 00:27:35,000
But you can have multiple
edges between vertices.

548
00:27:35,000 --> 00:27:37,460
And probably most
importantly is edges

549
00:27:37,460 --> 00:27:39,460
are often what we
call hyper, and this

550
00:27:39,460 --> 00:27:43,830
is a very important concept
that is almost never touched

551
00:27:43,830 --> 00:27:46,410
on in graph theory, which
is that an edge can connect

552
00:27:46,410 --> 00:27:49,250
multiple vertices
at the same time.

553
00:27:49,250 --> 00:27:51,806
A classic example
would be an email

554
00:27:51,806 --> 00:27:53,180
that you send to
multiple people.

555
00:27:53,180 --> 00:27:56,870
In fact, most emails these days
have more than one recipient.

556
00:27:56,870 --> 00:28:01,400
And so, me sending
an email to everyone

557
00:28:01,400 --> 00:28:04,260
in the class, one email
to everyone in the class,

558
00:28:04,260 --> 00:28:05,170
is a hyper edge.

559
00:28:05,170 --> 00:28:08,720
That edge, the email,
connects and has direction,

560
00:28:08,720 --> 00:28:11,530
you know, me with
all the recipients.

561
00:28:11,530 --> 00:28:13,960
That's very different
than me individually

562
00:28:13,960 --> 00:28:16,700
sending an email to every
single one of you, which

563
00:28:16,700 --> 00:28:18,740
would be a standard edge.

564
00:28:18,740 --> 00:28:22,590
So this is an important concept
not typically discussed a lot

565
00:28:22,590 --> 00:28:26,680
in standard graph theory
that is very important.

566
00:28:26,680 --> 00:28:29,460
So we're going to move beyond
the sort of standard definition

567
00:28:29,460 --> 00:28:33,225
of graphs in this class,
and that's going to be hard.

568
00:28:33,225 --> 00:28:34,600
We're going to
deal with a bigger

569
00:28:34,600 --> 00:28:37,110
definition of linear algebra.

570
00:28:37,110 --> 00:28:40,430
Traditionally, linear algebra
is dealing with matrices.

571
00:28:40,430 --> 00:28:44,390
They have indices for
the rows and the columns

572
00:28:44,390 --> 00:28:47,130
and the values are real numbers.

573
00:28:47,130 --> 00:28:49,310
Maybe integers, may
be complex numbers,

574
00:28:49,310 --> 00:28:51,790
we deal with complex
numbers here a lot,

575
00:28:51,790 --> 00:28:54,706
but we're going to be dealing
with things like strings.

576
00:28:54,706 --> 00:28:56,330
We're going to be
taking linear algebra

577
00:28:56,330 --> 00:28:58,900
and applying it to
things like words.

578
00:28:58,900 --> 00:29:00,867
And so that's going
to be different.

579
00:29:00,867 --> 00:29:02,450
In fact actually,
one of my favorite--

580
00:29:02,450 --> 00:29:03,670
it might be the part
of the course that

581
00:29:03,670 --> 00:29:05,253
puts you all asleep,
but it's actually

582
00:29:05,253 --> 00:29:07,860
one of my favorite
parts of the course.

583
00:29:07,860 --> 00:29:11,530
And then, bigger
definition of processing.

584
00:29:11,530 --> 00:29:13,200
One of the dominant
things you'll

585
00:29:13,200 --> 00:29:15,340
hear about in this
space-- there's

586
00:29:15,340 --> 00:29:17,290
a lot of technologies out there.

587
00:29:17,290 --> 00:29:19,600
A popular technology
is called Hadoop,

588
00:29:19,600 --> 00:29:23,210
and it's map/reduce parallel
programming paradigm, which

589
00:29:23,210 --> 00:29:29,960
is a great technology to
kind of get people started,

590
00:29:29,960 --> 00:29:33,100
but for people with the
mathematical sophistication

591
00:29:33,100 --> 00:29:35,810
that we have, we can
use programming models

592
00:29:35,810 --> 00:29:37,950
that are simply better.

593
00:29:37,950 --> 00:29:41,080
If you understand mathematics
to the level that we do,

594
00:29:41,080 --> 00:29:44,490
we can give you tools for
programming parallel computers

595
00:29:44,490 --> 00:29:47,890
that are just more efficient
in terms of you write less code

596
00:29:47,890 --> 00:29:51,054
and you get better performance
than using techniques that

597
00:29:51,054 --> 00:29:52,970
really are designed for
people that don't have

598
00:29:52,970 --> 00:29:55,150
the mathematical, the
linear algebraic background

599
00:29:55,150 --> 00:29:57,200
that you have.

600
00:29:57,200 --> 00:30:02,030
So this is really going to be
the foundation of the course.

601
00:30:02,030 --> 00:30:05,230
So let me continue here
with the course outline.

602
00:30:05,230 --> 00:30:06,899
So this is going
to be, essentially,

603
00:30:06,899 --> 00:30:08,190
the nine lecture of the course.

604
00:30:08,190 --> 00:30:10,500
I think this is still
pretty accurate.

605
00:30:10,500 --> 00:30:12,060
So we're in the
introductory course,

606
00:30:12,060 --> 00:30:14,240
we're going to review
the course and goals.

607
00:30:14,240 --> 00:30:16,400
The next course is really
going to be dealing

608
00:30:16,400 --> 00:30:18,440
with this concept of
associative arrays, which

609
00:30:18,440 --> 00:30:22,480
is the core technology that sort
of brings all this together.

610
00:30:22,480 --> 00:30:25,270
When I talk about extending
linear algebra to words,

611
00:30:25,270 --> 00:30:28,770
using funny, using, sorry, by
using fuzzy algebra that really

612
00:30:28,770 --> 00:30:31,840
gets into mathematical
abstract algebraic concepts

613
00:30:31,840 --> 00:30:33,730
called group theory.

614
00:30:33,730 --> 00:30:37,380
It's not nearly as
scary as it may sound.

615
00:30:37,380 --> 00:30:38,220
We'll get into that.

616
00:30:38,220 --> 00:30:39,761
It's actually, I
think, quite elegant

617
00:30:39,761 --> 00:30:41,470
and really good to know.

618
00:30:41,470 --> 00:30:44,740
Good to know that when you
work with the technology

619
00:30:44,740 --> 00:30:46,575
and you work with
associative arrays,

620
00:30:46,575 --> 00:30:49,110
that we've actually thought
about the mathematics of how

621
00:30:49,110 --> 00:30:50,880
these things work together.

622
00:30:50,880 --> 00:30:52,970
And when we add a
feature, it's usually

623
00:30:52,970 --> 00:30:55,621
because it fits
mathematically, and when

624
00:30:55,621 --> 00:30:57,120
we don't add a
feature, it's usually

625
00:30:57,120 --> 00:30:58,170
because someone
has made a request

626
00:30:58,170 --> 00:31:00,253
and like, yeah, that's
really going to take people

627
00:31:00,253 --> 00:31:02,290
into a bad place.

628
00:31:02,290 --> 00:31:04,960
Then we're going to get into
really sort of the center

629
00:31:04,960 --> 00:31:06,540
part of the class
where we're going

630
00:31:06,540 --> 00:31:09,170
to show you how we do
analysis of entities

631
00:31:09,170 --> 00:31:11,745
on unstructured data,
and then doing analysis

632
00:31:11,745 --> 00:31:13,445
on unstructured data.

633
00:31:13,445 --> 00:31:15,320
We're going to--I've
talked about power laws.

634
00:31:15,320 --> 00:31:18,600
We' going to whole class
on power law data, modeling

635
00:31:18,600 --> 00:31:20,685
of that deal, but then
we have another class

636
00:31:20,685 --> 00:31:22,707
in cross-correlation,
and then we're

637
00:31:22,707 --> 00:31:24,540
really going to get
into parallel processing

638
00:31:24,540 --> 00:31:25,270
and databases.

639
00:31:25,270 --> 00:31:27,110
And the last two
classes aren't ever

640
00:31:27,110 --> 00:31:28,770
going to really be lecturers.

641
00:31:28,770 --> 00:31:32,094
They're going to be really
just the demos, and really--

642
00:31:32,094 --> 00:31:34,760
because that's where we're going
to really bring a lot of things

643
00:31:34,760 --> 00:31:37,110
together, and we're going to
walk you through code examples,

644
00:31:37,110 --> 00:31:39,430
but we're going to need to take
some time to really walk you

645
00:31:39,430 --> 00:31:41,180
through them, because
you're going to see,

646
00:31:41,180 --> 00:31:42,840
wow, there's all
these ideas that we've

647
00:31:42,840 --> 00:31:45,300
talked about coming together
in these code examples.

648
00:31:45,300 --> 00:31:48,160
So there is, we have lectures.

649
00:31:48,160 --> 00:31:51,520
If you see in the software
that's available in your LLGrid

650
00:31:51,520 --> 00:31:54,430
accounts, all the lectures
are there already,

651
00:31:54,430 --> 00:31:57,740
and I think we have
lectures one through seven.

652
00:31:57,740 --> 00:31:59,940
So seven is the eighth lecture.

653
00:31:59,940 --> 00:32:02,010
Or, I'm sorry, zero
through seven, so seven

654
00:32:02,010 --> 00:32:05,602
is the eighth
lecture, and that will

655
00:32:05,602 --> 00:32:06,810
be a kind of a short lecture.

656
00:32:06,810 --> 00:32:10,940
And then the ninth lecture will
be just going through examples.

657
00:32:13,550 --> 00:32:17,770
So we believe in
this whole philosophy

658
00:32:17,770 --> 00:32:23,960
of taking a linear
algebraic point of view

659
00:32:23,960 --> 00:32:27,050
with graphs so much that we
even wrote a book about it.

660
00:32:27,050 --> 00:32:31,120
So you have all been
given copies of this book.

661
00:32:31,120 --> 00:32:36,570
This really gives you the
philosophy and the power

662
00:32:36,570 --> 00:32:39,235
of if you think about
graphs using the techniques,

663
00:32:39,235 --> 00:32:42,430
so I'd highly encourage
you to flip through it.

664
00:32:42,430 --> 00:32:44,750
There are certain parts of
this, certain sections here

665
00:32:44,750 --> 00:32:47,710
that are really almost taken
directly out of this book,

666
00:32:47,710 --> 00:32:50,680
but a lot of it is
supplementary material.

667
00:32:50,680 --> 00:32:53,820
And what you'll discover is
that this course is really

668
00:32:53,820 --> 00:32:56,560
the bridge to these techniques.

669
00:32:56,560 --> 00:32:58,890
This course allows
you to go from kind

670
00:32:58,890 --> 00:33:02,200
of this mess of
unstructured words

671
00:33:02,200 --> 00:33:05,640
and other types of
things, and then

672
00:33:05,640 --> 00:33:07,985
put them into this nice
linear algebraic format,

673
00:33:07,985 --> 00:33:11,760
so that is now you can
do these techniques that

674
00:33:11,760 --> 00:33:13,420
are described in the book.

675
00:33:13,420 --> 00:33:16,200
So this is almost a bridge
that and I'd certainly

676
00:33:16,200 --> 00:33:19,050
encourage you to look
at that and to ask me

677
00:33:19,050 --> 00:33:20,510
questions about it at any time.

678
00:33:23,330 --> 00:33:27,860
So now I'm going to come back
to our original example, which

679
00:33:27,860 --> 00:33:29,694
was the cyber thing
and kind of walk us

680
00:33:29,694 --> 00:33:31,360
through that in a
little bit more detail

681
00:33:31,360 --> 00:33:32,985
and maybe highlight
a few of the things

682
00:33:32,985 --> 00:33:34,654
that I've talked about already.

683
00:33:37,260 --> 00:33:43,499
So here's a very standard
data processing pipeline.

684
00:33:43,499 --> 00:33:45,040
So we want to do is
we have raw data,

685
00:33:45,040 --> 00:33:48,760
in this case our raw cyber data,
which is a series of records.

686
00:33:48,760 --> 00:33:52,070
And we need to convert that
into a set of vertices and edge

687
00:33:52,070 --> 00:33:58,620
lists from which we can then
do graphs and graphs analysis.

688
00:33:58,620 --> 00:34:03,050
And, as you'll see,
typically what you do

689
00:34:03,050 --> 00:34:08,090
is you write parsers to convert
this data from some raw format

690
00:34:08,090 --> 00:34:10,659
to some format that is
sort of the next step

691
00:34:10,659 --> 00:34:12,370
in your processing chain.

692
00:34:12,370 --> 00:34:14,800
And then we convert
these edge lists

693
00:34:14,800 --> 00:34:23,040
into adjacency matrices, which
is how we view our graphs.

694
00:34:23,040 --> 00:34:25,530
One word on adjacency
matrix, let me go back here.

695
00:34:28,719 --> 00:34:32,219
This is a graph, [LAUGHS] for
those of you who don't know.

696
00:34:32,219 --> 00:34:35,800
The set of vertices
and edges, OK?

697
00:34:35,800 --> 00:34:38,590
And this is an adjacency
matrix of that graph.

698
00:34:38,590 --> 00:34:41,590
Basically every single
row is a vertex,

699
00:34:41,590 --> 00:34:43,650
every single column is
a vertex, and if an edge

700
00:34:43,650 --> 00:34:47,159
exists we put a dot here.

701
00:34:47,159 --> 00:34:49,489
There's a formal duality
between graph theory,

702
00:34:49,489 --> 00:34:51,360
the fundamental
operation of graph theory

703
00:34:51,360 --> 00:34:55,030
is what's called breadth-first
search, which is shown here,

704
00:34:55,030 --> 00:34:57,440
giving a starting vertex.

705
00:34:57,440 --> 00:35:00,150
You know, traverse its
edges to its neighbors,

706
00:35:00,150 --> 00:35:03,870
and we have the same
thing going on here.

707
00:35:03,870 --> 00:35:06,180
Given a starting
vertex in a vector,

708
00:35:06,180 --> 00:35:10,010
we do a matrix multiply
to identify the neighbors.

709
00:35:10,010 --> 00:35:14,781
And so at the deepest, most
fundamental level, graph theory

710
00:35:14,781 --> 00:35:16,030
and linear algebra are linked.

711
00:35:16,030 --> 00:35:18,071
Because the fundamental
operation of graph theory

712
00:35:18,071 --> 00:35:21,630
is breadth-first search and
the fundamental operation

713
00:35:21,630 --> 00:35:25,267
of linear algebra is
vector matrix multiply.

714
00:35:28,680 --> 00:35:31,390
So we wish to form these
adjacency matrices so we

715
00:35:31,390 --> 00:35:34,177
can take advantage of that.

716
00:35:34,177 --> 00:35:35,760
Here's a little bit
more detail on how

717
00:35:35,760 --> 00:35:39,005
we're going to use this
technology D4M to do that.

718
00:35:39,005 --> 00:35:41,890
So we have our raw data,
we convert the raw data

719
00:35:41,890 --> 00:35:45,200
into a CSV file-- stands for
a comma separated value files.

720
00:35:45,200 --> 00:35:48,814
It's kind of the default format
of spreadsheets and tables.

721
00:35:48,814 --> 00:35:50,480
It's become very
popular over the years.

722
00:35:50,480 --> 00:35:52,688
I remember when it first
came out 20 or 30 years ago,

723
00:35:52,688 --> 00:35:53,686
it wasn't that popular.

724
00:35:53,686 --> 00:35:55,560
You know, it's like
people did-- but now it's

725
00:35:55,560 --> 00:35:58,805
become extraordinarily
popular file format.

726
00:35:58,805 --> 00:36:02,090
It really is ideally suited
for this kind of data.

727
00:36:02,090 --> 00:36:05,930
So we convert that
into CSV files.

728
00:36:05,930 --> 00:36:08,380
We then take those
CSV files, read them

729
00:36:08,380 --> 00:36:11,710
in using our D4M
technology to insert them

730
00:36:11,710 --> 00:36:14,290
into a distributed database.

731
00:36:14,290 --> 00:36:16,510
We can then query that
distributed database to get

732
00:36:16,510 --> 00:36:18,760
these associative arrays.

733
00:36:18,760 --> 00:36:20,670
And then we can, from
the associative arrays,

734
00:36:20,670 --> 00:36:24,320
now we have the full power
to do our graph algorithms.

735
00:36:24,320 --> 00:36:28,050
So D4M really helps
you in going, you know,

736
00:36:28,050 --> 00:36:30,880
these few steps and setting the
table so that you can then you

737
00:36:30,880 --> 00:36:31,735
graph algorithms.

738
00:36:31,735 --> 00:36:33,610
And you can even then
do the graph algorithms

739
00:36:33,610 --> 00:36:36,090
without even using D4M at all.

740
00:36:36,090 --> 00:36:39,410
You can just use regular linear
algebraic operations in MATLAB

741
00:36:39,410 --> 00:36:40,410
just fine.

742
00:36:40,410 --> 00:36:42,100
But this is that
bridge, that connector.

743
00:36:42,100 --> 00:36:44,590
In fact, I highly
recommend people do that.

744
00:36:44,590 --> 00:36:47,210
I highly recommend you
use D4M for only the parts

745
00:36:47,210 --> 00:36:49,750
of your problem
that it's good for,

746
00:36:49,750 --> 00:36:52,310
and use other
technologies, you know,

747
00:36:52,310 --> 00:36:56,340
like MATLAB, ordinary matrices,
and other types of things.

748
00:36:56,340 --> 00:36:58,640
That's good for you, you'll
get better performance.

749
00:36:58,640 --> 00:37:02,260
It's good for me because,
if you have problems,

750
00:37:02,260 --> 00:37:04,504
you'll bother somebody
else, and not us.

751
00:37:04,504 --> 00:37:06,170
So that's good, it's
good for everybody.

752
00:37:11,110 --> 00:37:15,570
So an example of kind
of what this looks like.

753
00:37:15,570 --> 00:37:19,310
Here's a proxy log that
we got from the data.

754
00:37:19,310 --> 00:37:24,900
It shows a web, essentially a
web log, the first thing we do

755
00:37:24,900 --> 00:37:29,700
is essentially convert
this into a CSV file.

756
00:37:29,700 --> 00:37:33,500
So basically, each entry
here gets a column,

757
00:37:33,500 --> 00:37:36,810
we have an ID for
the actual entry.

758
00:37:36,810 --> 00:37:40,410
In this case it's a source
IP, a sever IP, a time stamp.

759
00:37:40,410 --> 00:37:45,076
And then, the actual line of
text that's associated with it.

760
00:37:45,076 --> 00:37:46,450
And this can go
on and on and on.

761
00:37:46,450 --> 00:37:48,680
We can have many
different types here.

762
00:37:48,680 --> 00:37:56,050
And if you are a
standard database person,

763
00:37:56,050 --> 00:37:59,420
put your SQL hat, which is sort
of the standard database, on.

764
00:37:59,420 --> 00:38:01,060
You would just take
that data, and you

765
00:38:01,060 --> 00:38:04,090
would make a table that
would have these columns,

766
00:38:04,090 --> 00:38:06,510
and you would insert them
and away you would go up.

767
00:38:06,510 --> 00:38:07,640
OK?

768
00:38:07,640 --> 00:38:12,990
But what we discover is that--
the challenge becomes when you

769
00:38:12,990 --> 00:38:15,796
want to look up data quickly.

770
00:38:15,796 --> 00:38:16,920
Or you want to insert data.

771
00:38:16,920 --> 00:38:18,794
Let's say you have an
enormous volume of data

772
00:38:18,794 --> 00:38:20,850
that you want to
insert and you want

773
00:38:20,850 --> 00:38:24,610
to go to look up any, say, IP
address, or any time stamp,

774
00:38:24,610 --> 00:38:26,230
or something like that.

775
00:38:26,230 --> 00:38:30,270
What you discover is
that SQL might give you

776
00:38:30,270 --> 00:38:32,722
good performance, but there
are databases out there

777
00:38:32,722 --> 00:38:34,930
and technology that will
give you better performance.

778
00:38:38,290 --> 00:38:42,400
And these databases are
called triple stores,

779
00:38:42,400 --> 00:38:48,330
because they store all
the data as-- anyone?

780
00:38:48,330 --> 00:38:49,660
Triples, yes.

781
00:38:49,660 --> 00:38:52,550
They store all the
data as triples,

782
00:38:52,550 --> 00:38:55,980
which means we will have
a row, and a column,

783
00:38:55,980 --> 00:38:57,860
and a value associated
with everything.

784
00:38:57,860 --> 00:39:00,930
So what we do is we
take our dense data,

785
00:39:00,930 --> 00:39:04,680
and we do what we call,
create an exploded table.

786
00:39:04,680 --> 00:39:08,690
So essentially we have the
log ID, let me append that,

787
00:39:08,690 --> 00:39:15,025
and we take the column name
and we append its value to it,

788
00:39:15,025 --> 00:39:16,650
and server IP.

789
00:39:16,650 --> 00:39:19,210
And this is what you would
call an exploded table.

790
00:39:19,210 --> 00:39:21,890
Now, if you're an old
SQL person like me,

791
00:39:21,890 --> 00:39:24,180
this schema will
immediately give you hives,

792
00:39:24,180 --> 00:39:25,770
and other types of
allergic reactions,

793
00:39:25,770 --> 00:39:27,570
and you will want to
go to the hospital,

794
00:39:27,570 --> 00:39:31,750
because you're creating an
enormous number of columns.

795
00:39:31,750 --> 00:39:35,420
And in SQL you where you can
dynamically create columns,

796
00:39:35,420 --> 00:39:37,550
but at your peril.

797
00:39:37,550 --> 00:39:40,390
Basically, you can create
columns dynamically,

798
00:39:40,390 --> 00:39:42,060
but the database
reserves the right

799
00:39:42,060 --> 00:39:45,510
to recopy the entire table,
and reformat the entire table,

800
00:39:45,510 --> 00:39:47,392
if it so requires.

801
00:39:47,392 --> 00:39:49,350
Well triple stores don't
have this requirement.

802
00:39:49,350 --> 00:39:52,650
Triple stores only store
the non-blank entries.

803
00:39:52,650 --> 00:39:55,780
And so they're perfectly happy
with as many columns you want.

804
00:39:55,780 --> 00:39:59,700
If you have a billion rows,
10 billion columns, completely

805
00:39:59,700 --> 00:40:01,760
fine with that.

806
00:40:01,760 --> 00:40:02,988
So that's a powerful thing.

807
00:40:05,740 --> 00:40:07,360
So I think we talked about that.

808
00:40:07,360 --> 00:40:10,170
So we use the indices
as rows, and we

809
00:40:10,170 --> 00:40:15,665
create unique type-value
pairs for every single column.

810
00:40:19,820 --> 00:40:28,770
Now, by itself, this schema
does you absolutely no good.

811
00:40:28,770 --> 00:40:32,580
You basically just
made your life harder,

812
00:40:32,580 --> 00:40:37,930
because, by itself,
most database tables,

813
00:40:37,930 --> 00:40:41,576
database systems,
including triple stores,

814
00:40:41,576 --> 00:40:42,950
are going to have
an orientation.

815
00:40:42,950 --> 00:40:45,158
They're going to be row
oriented, or column oriented.

816
00:40:45,158 --> 00:40:47,580
And in this case, the databases
we will be talking about

817
00:40:47,580 --> 00:40:49,530
are almost always row oriented.

818
00:40:49,530 --> 00:40:52,990
Which means they can get
a row in constant time,

819
00:40:52,990 --> 00:40:56,870
give it a row key like
this and you can look it up

820
00:40:56,870 --> 00:40:57,550
very quickly.

821
00:40:57,550 --> 00:40:59,450
And it will give it back
to you very quickly.

822
00:40:59,450 --> 00:41:00,830
That's what it's design to do.

823
00:41:00,830 --> 00:41:03,550
And you can insert
enormous volumes of data,

824
00:41:03,550 --> 00:41:06,650
and they will preserve
that contract.

825
00:41:06,650 --> 00:41:09,010
You can be inserting millions
of entries per second,

826
00:41:09,010 --> 00:41:11,530
and you can still look
up stuff in milliseconds.

827
00:41:11,530 --> 00:41:15,100
So very powerful technology,
but all we've done so far

828
00:41:15,100 --> 00:41:19,010
has made it so that
the format of my data

829
00:41:19,010 --> 00:41:21,450
is in this funny format,
coming out the other side.

830
00:41:21,450 --> 00:41:24,240
I've just made myself harder.

831
00:41:24,240 --> 00:41:34,380
Well, there is a payoff,
which is we also store

832
00:41:34,380 --> 00:41:36,670
the transpose of the table.

833
00:41:36,670 --> 00:41:39,540
So, to anybody who is used
to sparse linear algebra,

834
00:41:39,540 --> 00:41:41,990
this is an old
ancient technique.

835
00:41:41,990 --> 00:41:44,320
We almost always store
A and A transpose,

836
00:41:44,320 --> 00:41:46,230
because in sparse
linear algebra we'll

837
00:41:46,230 --> 00:41:50,260
either have a row oriented
or column based store.

838
00:41:50,260 --> 00:41:52,370
And then we just want
to keep one around

839
00:41:52,370 --> 00:41:54,731
because certain operations
are going to be faster

840
00:41:54,731 --> 00:41:56,480
with the transpose,
and certain operations

841
00:41:56,480 --> 00:41:58,380
are going to be faster
with the matrix.

842
00:41:58,380 --> 00:42:00,900
We do the exact same
thing here, and combined

843
00:42:00,900 --> 00:42:03,410
with our exploded schema,
now every single one

844
00:42:03,410 --> 00:42:05,600
of those columns is a row.

845
00:42:05,600 --> 00:42:08,600
And now we have indexed
every single string

846
00:42:08,600 --> 00:42:10,270
in the entire database.

847
00:42:10,270 --> 00:42:14,190
Once, with one simple
schema of two tables.

848
00:42:14,190 --> 00:42:18,600
And not to say that all
your databases will only

849
00:42:18,600 --> 00:42:24,040
have two tables, but they
might be three tables or four

850
00:42:24,040 --> 00:42:26,300
tables-- is typically
what we see,

851
00:42:26,300 --> 00:42:29,960
which is a huge
reduction compared

852
00:42:29,960 --> 00:42:31,950
to the standard database
schemas that one

853
00:42:31,950 --> 00:42:35,030
uses where it's easy to have
dozens and dozens of tables,

854
00:42:35,030 --> 00:42:38,550
even 100 of tables in a
well thought out schema.

855
00:42:38,550 --> 00:42:41,920
Here, essentially it's one
schema to rule them all.

856
00:42:41,920 --> 00:42:43,420
One table and its
pair, and you've

857
00:42:43,420 --> 00:42:45,070
indexed every single thing.

858
00:42:45,070 --> 00:42:47,302
And the cost is, you've
doubled your storage.

859
00:42:47,302 --> 00:42:49,010
And every single
database person that you

860
00:42:49,010 --> 00:42:52,650
talk to-- if I said in exchange
for doubling your storage I

861
00:42:52,650 --> 00:42:55,950
would give you a fast index
access to every single string

862
00:42:55,950 --> 00:42:57,540
in your database,
they'll usually

863
00:42:57,540 --> 00:43:02,280
take that deal 100
times out of 100.

864
00:43:02,280 --> 00:43:05,000
So that's the power
here of this technology,

865
00:43:05,000 --> 00:43:07,789
is you can reference
enormous data sets.

866
00:43:07,789 --> 00:43:10,080
And we'll come back to this
schema over and over again.

867
00:43:10,080 --> 00:43:11,530
But that's sort of the madness.

868
00:43:11,530 --> 00:43:15,030
We'll be-- as we move forward
with associate arrays we always

869
00:43:15,030 --> 00:43:16,470
view things in this way.

870
00:43:16,470 --> 00:43:17,947
And I want you to
know, at the end,

871
00:43:17,947 --> 00:43:19,780
there's a really good
reason why we do that.

872
00:43:22,760 --> 00:43:28,380
So now we've ingested our
data into the database

873
00:43:28,380 --> 00:43:29,370
and we can do queries.

874
00:43:29,370 --> 00:43:32,270
So if I have a binding
to a table-- and one

875
00:43:32,270 --> 00:43:33,980
of the things in the
deformed technology

876
00:43:33,980 --> 00:43:35,480
that we do that's
very nice for you,

877
00:43:35,480 --> 00:43:38,210
is if you're using a
table and its transpose,

878
00:43:38,210 --> 00:43:40,060
we hide that for you completely.

879
00:43:40,060 --> 00:43:41,790
If you do an insert
into the table,

880
00:43:41,790 --> 00:43:44,860
it will automatically
insert its transpose.

881
00:43:44,860 --> 00:43:47,516
And if you do a look-up on rows,
it will know to use one table,

882
00:43:47,516 --> 00:43:48,890
and it does a
look-up on columns,

883
00:43:48,890 --> 00:43:50,870
it will know to use
the other table.

884
00:43:50,870 --> 00:43:53,370
So here we're doing
essentially a range query.

885
00:43:53,370 --> 00:43:58,190
This just says give me all
the data from this range.

886
00:43:58,190 --> 00:44:01,500
Time stamp May
whatever to, it looks

887
00:44:01,500 --> 00:44:03,430
like about a 3 hour time window.

888
00:44:03,430 --> 00:44:05,140
So that's a query
into the database,

889
00:44:05,140 --> 00:44:08,770
and it returns an
associative array of keys.

890
00:44:08,770 --> 00:44:10,350
And here is what
that looks like.

891
00:44:10,350 --> 00:44:11,724
It looks like a
bunch of triples.

892
00:44:14,930 --> 00:44:17,370
I'm going to then take
the rows of those keys.

893
00:44:17,370 --> 00:44:22,580
So basically I found
every single row,

894
00:44:22,580 --> 00:44:24,160
within a particular
time stamp, and I

895
00:44:24,160 --> 00:44:26,170
want to get the whole row.

896
00:44:26,170 --> 00:44:31,200
So I can get the row of that and
pass that back into the table

897
00:44:31,200 --> 00:44:33,190
to get the actual data.

898
00:44:33,190 --> 00:44:34,790
Now you see I get the whole row.

899
00:44:34,790 --> 00:44:37,856
So I get server IP, and all
that stuff, the complete list.

900
00:44:42,050 --> 00:44:44,540
Now I'm going to do a
little algebra here.

901
00:44:44,540 --> 00:44:46,780
I want to create the graph.

902
00:44:46,780 --> 00:44:52,480
It's all source
IPs and server IPs.

903
00:44:52,480 --> 00:44:56,440
I get to do this with this
basically correlation here,

904
00:44:56,440 --> 00:44:58,610
which is essentially a
matrix-matrix multiply.

905
00:44:58,610 --> 00:45:04,030
That's it, I've just constructed
the entire source IP and server

906
00:45:04,030 --> 00:45:07,170
IP graph from all the
data in this time window.

907
00:45:07,170 --> 00:45:09,080
That is the power of D4M.

908
00:45:09,080 --> 00:45:13,040
It allows you to do correlations
on this kind of data,

909
00:45:13,040 --> 00:45:16,160
same way we do correlations
using traditional techniques

910
00:45:16,160 --> 00:45:18,020
in linear algebra.

911
00:45:18,020 --> 00:45:21,320
And you can be like, "Wow."

912
00:45:21,320 --> 00:45:22,800
This is motivational
at this point.

913
00:45:22,800 --> 00:45:25,390
We'll get into exactly
how you do this,

914
00:45:25,390 --> 00:45:28,940
and exactly how this
works in the example.

915
00:45:28,940 --> 00:45:30,980
But this is really
what people use

916
00:45:30,980 --> 00:45:34,890
D4M for, is to get to
this point where can do

917
00:45:34,890 --> 00:45:36,660
these types of manipulations.

918
00:45:36,660 --> 00:45:39,650
And the key to deduction
theory is doing correlations.

919
00:45:39,650 --> 00:45:41,910
Correlations allow
us to determine

920
00:45:41,910 --> 00:45:43,630
what the backgrounds
of our data are,

921
00:45:43,630 --> 00:45:45,600
what the clutter
of our data are,

922
00:45:45,600 --> 00:45:49,200
and then we are on our way
with traditional detection

923
00:45:49,200 --> 00:45:51,710
theory, and signal processing
techniques, and all the things

924
00:45:51,710 --> 00:45:52,584
that we know we love.

925
00:45:52,584 --> 00:45:57,360
So this is kind of that
connection made manifest.

926
00:45:57,360 --> 00:46:00,650
And so here you can see, we
now have an associative array

927
00:46:00,650 --> 00:46:03,870
or an adjacency matrix that
eventually, essentially

928
00:46:03,870 --> 00:46:07,106
sorts IPs and their server IPs.

929
00:46:07,106 --> 00:46:08,870
You can actually
plot that and that's

930
00:46:08,870 --> 00:46:12,910
how we get the adjacency
matrix of this graph,

931
00:46:12,910 --> 00:46:15,480
G. This is what it is.

932
00:46:15,480 --> 00:46:17,960
That's how we constructed--
this is the code we actually

933
00:46:17,960 --> 00:46:19,484
used to construct that graph.

934
00:46:22,860 --> 00:46:23,840
So moving on here.

935
00:46:23,840 --> 00:46:26,750
So this just shows the kind
of the whole thing here.

936
00:46:26,750 --> 00:46:29,990
We had a whole week's
worth of proxy data.

937
00:46:29,990 --> 00:46:32,670
This just shows you the timing,
that it took about two hours

938
00:46:32,670 --> 00:46:38,330
to ingest it, and about three
hours to do this processing.

939
00:46:38,330 --> 00:46:43,500
100 million proxy logs,
44.5 billion triples.

940
00:46:43,500 --> 00:46:45,390
This is big data.

941
00:46:45,390 --> 00:46:47,630
You know, really
this is big data.

942
00:46:47,630 --> 00:46:50,860
This is the kind of
thing that you can do.

943
00:46:50,860 --> 00:46:53,030
Whatever anybody
else is doing, you

944
00:46:53,030 --> 00:46:56,372
should be able to do on larger
scales than other people

945
00:46:56,372 --> 00:46:57,288
using this technology.

946
00:46:59,920 --> 00:47:03,370
So that brings us to
the end of the lecture.

947
00:47:03,370 --> 00:47:07,112
So just to summarize,
big data of this type

948
00:47:07,112 --> 00:47:09,320
is found in a wide range of
areas, document analysis,

949
00:47:09,320 --> 00:47:12,520
computer networks,
DNA sequencing.

950
00:47:12,520 --> 00:47:14,300
There's a kind of a
gap between the tools

951
00:47:14,300 --> 00:47:16,680
that people traditionally
use to do this problem,

952
00:47:16,680 --> 00:47:20,420
and D4M, this technology,
fills this gap.

953
00:47:20,420 --> 00:47:24,260
So, with that, are
there any questions

954
00:47:24,260 --> 00:47:27,730
before we get onto
the examples slide?

955
00:47:27,730 --> 00:47:28,258
Yes?

956
00:47:28,258 --> 00:47:29,174
AUDIENCE: [INAUDIBLE]?

957
00:47:34,799 --> 00:47:35,590
JEREMY KEPNER: Yes.

958
00:47:35,590 --> 00:47:36,506
AUDIENCE: [INAUDIBLE]?

959
00:47:39,740 --> 00:47:43,620
JEREMY KEPNER: So the data--

960
00:47:43,717 --> 00:47:45,300
AUDIENCE: Can you
repeat the question?

961
00:47:45,300 --> 00:47:46,758
JEREMY KEPNER: Oh
yes, the question

962
00:47:46,758 --> 00:47:50,070
was does it generalize to
multi-dimensions, I guess?

963
00:47:50,070 --> 00:47:54,935
And so, there may be kind
of two questions there.

964
00:47:57,870 --> 00:48:02,400
If you had-- if I just
correlated all the data,

965
00:48:02,400 --> 00:48:06,300
took that table out, and
just fully correlated it,

966
00:48:06,300 --> 00:48:09,110
I would end up
with a giant matrix

967
00:48:09,110 --> 00:48:12,890
of blocks, each one representing
different correlations.

968
00:48:12,890 --> 00:48:16,560
So the source IP, server IP
correlation, the source IP,

969
00:48:16,560 --> 00:48:18,380
source IP correlation,
which would be empty

970
00:48:18,380 --> 00:48:20,280
because it's a bipartite graph.

971
00:48:20,280 --> 00:48:21,790
And then all the
other correlations

972
00:48:21,790 --> 00:48:25,160
would form different blocks,
and that type of thing.

973
00:48:25,160 --> 00:48:28,284
So in that sense we
support multi-dimensions.

974
00:48:28,284 --> 00:48:30,450
Whether we support-- but
supporting multi-dimensions

975
00:48:30,450 --> 00:48:33,180
in terms of tensors, OK?

976
00:48:33,180 --> 00:48:35,860
Well, generally in
sparse linear algebra,

977
00:48:35,860 --> 00:48:39,770
there's very minimal support
for higher dimensions.

978
00:48:39,770 --> 00:48:41,930
I'll put a little shout
out to my friend Tammy

979
00:48:41,930 --> 00:48:45,630
Kolda, who has a technology
called the Sparse Tensor

980
00:48:45,630 --> 00:48:46,270
Toolbox.

981
00:48:46,270 --> 00:48:51,730
She's a scientist at Sandia and
so I encourage you to do that.

982
00:48:51,730 --> 00:48:54,050
But all of our stuff
is in two dimensions

983
00:48:54,050 --> 00:48:55,840
from a linear
algebra perspective.

984
00:48:55,840 --> 00:48:57,820
Also linear
algebraically, the math

985
00:48:57,820 --> 00:49:01,280
there is a little bit
more common than when

986
00:49:01,280 --> 00:49:02,710
you get into the tensor.

987
00:49:02,710 --> 00:49:05,137
And then you have these
kind of weird products

988
00:49:05,137 --> 00:49:07,470
that you're starting, which
you can define, but are just

989
00:49:07,470 --> 00:49:10,020
not commonly taught.

990
00:49:10,020 --> 00:49:12,382
So and you might be
like, well you're really

991
00:49:12,382 --> 00:49:13,340
losing something there.

992
00:49:13,340 --> 00:49:15,200
I'd really like to have
multiple dimensions.

993
00:49:15,200 --> 00:49:18,780
Well, when we've looked at
it, the fact of the matter

994
00:49:18,780 --> 00:49:23,560
is the sparsity always increases
as you go to higher dimensions.

995
00:49:23,560 --> 00:49:25,520
And the sparsity
is sparse enough.

996
00:49:25,520 --> 00:49:27,310
So typically the
kinds of data that we

997
00:49:27,310 --> 00:49:30,200
will have-- if you
have n vertices,

998
00:49:30,200 --> 00:49:33,840
you'll have maybe
10 times n edges.

999
00:49:33,840 --> 00:49:38,230
So the sparsity
is extremely low.

1000
00:49:38,230 --> 00:49:39,900
We added a dimension.

1001
00:49:39,900 --> 00:49:42,050
This sort of internal
data structure costs

1002
00:49:42,050 --> 00:49:44,820
associated with adding
that dimension would not

1003
00:49:44,820 --> 00:49:45,570
reap the benefits.

1004
00:49:45,570 --> 00:49:47,920
Because now your
sparsity is even lower.

1005
00:49:47,920 --> 00:49:50,630
And so that's
generally not something

1006
00:49:50,630 --> 00:49:52,034
that is commonly used.

1007
00:49:52,034 --> 00:49:53,700
But, if you're
interested exploring that

1008
00:49:53,700 --> 00:49:56,210
I certainly encourage people
to download that toolbox.

1009
00:49:56,210 --> 00:50:01,694
Very well written software,
and people can explore that.

1010
00:50:01,694 --> 00:50:02,360
Other questions?

1011
00:50:02,360 --> 00:50:03,980
That was an
outstanding question.

1012
00:50:03,980 --> 00:50:07,514
And other questions
at this time?

1013
00:50:07,514 --> 00:50:09,430
All right well why don't
we stop the recording

1014
00:50:09,430 --> 00:50:13,140
right now, and then we
will switch-- why don't we

1015
00:50:13,140 --> 00:50:14,980
take like a five minute break.

1016
00:50:14,980 --> 00:50:17,540
I think there's some water out
there and stuff for people.

1017
00:50:17,540 --> 00:50:21,040
And then we'll get back to
the examples, which probably

1018
00:50:21,040 --> 00:50:22,280
will be a lot more fun.

1019
00:50:22,280 --> 00:50:27,760
So hopefully wake you up
after these new graphs.