Recently I’ve been
tasked with writing down some thoughts as discussion ideas and teasers on current
digital forensics, It security and data science. Some of this were floating
around for a long time more as reaction to events than real effort to do a
serious discussion.
At first glance digital forensics and data
science does not have much in common, especially when we are talking about how
digital forensics is approached and executed today. What is usually not taken
into the account is the fact that digital forensics is the
part of both computer and
forensic science, two very different science fields. At the moment digital
forensics is a new field getting incorporated into forensics, digital specifics
should be recognized and incorporated into traditional forensic environment.
For start definitions should be stated. First
we can introduce forensics and digital forensics. Forensics is “The application
of scientific knowledge to legal problems" (Merriam-Webster), what
includes forensic medicine, physics, chemistry, dentistry, fingerprints, DNA,
firearm analysis, accounting all traditional sciences. In the other hand for
the digital forensics we have first idea of “Forensic Computing” by V. Venema,
D. Farmer late in 1990’s: „Gathering and analyzing data in a manner as free from
distortion or bias as possible to reconstruct data or what has happened in the
past on a system.”. When this definition of forensic computing is expanded with
digital evidence we get what is in current sense digital forensics. By
Wikipedia “Digital forensics and Computer forensics” is: defined as “Computer
forensics, sometimes known as computer forensic science is a branch of digital
forensic science pertaining to evidence found in computers and digital storage
media. The goal of computer forensics is to examine digital media in a
forensically sound manner with the aim of identifying, preserving, recovering,
analyzing and presenting facts and opinions about the digital information”. In
this context digital evidence or electronic evidence is defined
as “any probative information stored or transmitted in digital form
that a party to a court case may use at trial.”
To make things difficult digital evidence is the key element of digital forensics,
what makes it hard to accept in the traditional forensics and law
where sound physical evidence is golden standard. Also forensics science is not
dealing with big amount of data but with specific science scenarios and analysis
resulting in limited datasets, what causes different sensitivity and
understanding of the data and
computer science.
Even the basic Locard principle on which
forensic science is build up,
has its digital twist; Lockard’s Exchange Principle is "Every contact
leaves a trace" (Prof. Edmond Locard, c. 1910). It is perfectly correct,
log analysis was one of the first evolved branches of IT security and digital
forensics. .One of the key forensic principles is not to change evidence; when applied to digital forensics
means working with read only data copies with hash signatures providing proof
of data not being changed. Translating this to practical computing means
ability to do parallel processing limited only by media and processing bandwidth.
The core
problem of digital forensics today is the problem of processing huge volumes of
data. To be honest this is really a big unspoken obstacle
which is often overlooked, sometimes not understood by digital forensic
practitioners and even vendors. Disks size skyrocket from megabytes to tens of
terabytes; this sheer volume of data where relevant digital evidence is hidden
is a huge problem. Only to create a forensic copy of one terabyte disk you need
at least 3 hours and this is even before any analysis can be done. After that
step even more time consuming process of digital evidence finding and extraction
is started and it takes usually much longer - sometimes days are used in this
process. This step is analysis in digital forensics and is conceptually very
close to datamining process.
Current mainstream digital forensic tools are
not capable of efficient parallelism, automation or scripting and are limited
to Microsoft Windows platforms on Intel architecture, “general purpose PC
paradigm” which is not best choice for fast and efficient data processing.
Current problems and computing development makes
this issues practically unsolvable without using knowledge and experience form
other computing science fields, especially from data science. From data point
of view, we can separate digital forensics into two broad categories: classic
postmortem forensics and live forensic, in sense where we are dealing with
static data or dynamically changing data. In both situations we have to work
with raw data and transform it into meaningful digital evidence. This is even
more significant if we are talking about incident response in modern networked
systems. We can approach each end node involved in incident as data source
which has to be collected and analyzed; a situation where we have very
different types of data from raw binary disk and memory images to process structures,
elaborate log information or local agent database. At the moment all this data
is handled separately, not as a part of one picture. To address this issues in
efficient way data science knowledge should be used, to refine methods and
tools in digital forensics.
11.04.2016 link to draft presentation for this discussion
11.04.2016 link to draft presentation for this discussion