Are you a data scientist?

Leave a comment

Word Cloud of my 3000 tweets

Leave a comment

Citizen Journalism and the iPhone – reporting trend

Leave a comment

A History of Hactivism Infographic

Leave a comment

Calculating similarity scores in computers that talk to similar computers

For this part of the exercise, I look at 2 IP Address and calculate similarity using Euclidean distance and Pearson correlation. I created a small dataset that is a nested dictionary. I did manual calculations, but python’s Pandas can work the numbers easily. I calculate the distance of Lisa from Kirk by isolating and and plot those on a graph.  I do it for each of the combinations of people and each of the combinations of IP addresses. I even find people that are very similar and one that is not as similar.  This model can help understand clusters and identify baseline conversations between people and visited IP addresses. Somehow it all makes sense to me.

talkers={‘Lisa’: {‘’: 2.5, ‘’: 3.5,
‘’: 3.0, ‘’: 3.5, ‘’: 2.5,
‘’: 3.0},
‘Kirk’: {‘’: 3.0, ‘’: 3.5,
‘’: 1.5, ‘’: 5.0, ‘’: 3.0,
‘’: 3.5},
‘Phillip’: {‘’: 2.5, ‘’: 3.0,
‘’: 3.5, ‘’: 4.0},
‘Dan’: {‘’: 3.5, ‘’: 3.0,
‘’: 4.5, ‘’: 4.0,
‘’: 2.5},
‘James’: {‘’: 3.0, ‘’: 4.0,
‘’: 2.0, ‘’: 3.0, ‘’: 3.0,
‘’: 2.0},
‘Britney’: {‘’: 3.0, ‘’: 4.0,
‘’: 3.0, ‘’: 5.0, ‘’: 3.5},
‘Toby’: {‘′:4.5,’′:1.0,’’:4.0}}

Leave a comment Sneek Peek – Big Data Infosec Open Source Solution

Below is brain dump analog style

Below is model of brain dump

Below is an example interface that provides: Overview, Situational Data, interactivity, search and drill down capability

Leave a comment

Trying to unlock Level 6 Achievement – Predictive Analytics

Level 6 Challenge – Predictive Modeling – Attack Simulation – War Games

Organizations have created thousands of models and have a solid understand of the business and priorities. The organization is planning to use predictive analytics and statistical techniques from modeling, machine learning, data mining and game theory that analyze current and historical facts to make predictions about future events. These reports require heavy interaction of the BI, visualization, and Infosec teams to produce real validated results.

My journey into Data Mining

Following up on my recent blog post on Infosec and Big Data, I decided to write more about the journey into my though process and investigation of how the Infosec industry is going to change. The next few posts will detail some learnings around building a variety of algorithms for a Big Data system.

I’ve learned that data mining involves the selection of 2 paths.  Look at data to explain the past or use data to explain the future.  A variety of algorithms have been created by SIEM vendors to identify attacks.  Most of the algorithms are easy to reproduce for simple attacks like DOS and DDOS.  Any attacks that flood, brute-force or break a threshold are easy to make. Twitter, Facebook, Netflix, Google and “put-your-social-here” have figured out how to data mine data/machine logs.

Data Mining —- Future   —- Modelling

                      \_  Past   ___ Exploration

From my own experience, I have found it trivial to create a few examples to “Data Mine the Past”.  There are literally thousands of examples on how to data mine social. Everyone and their aardvarks have used a tool to “data mine” and visualized past data. I’m finding it less trivial to create “predictive models of the future”. Below is a graphic I found that explained the connection of Data Mining to other concepts.

Making your own “predictive models of the future”

Predictive Models are also known as “machine learning” and also known as “pattern recognition”.  Many models use one model and give the user one answer.  These one2one models are formula based models. The first challenge I ran into was to decide what tool to use to model in. R? Python? Weka? Mahout?.  Next was finding a variety of data sets, cleaning the data sets, and loading them up on the tool of my choice.

NOTE: All solutions must use Hadoop. Googling “predictive model marketplace” didn’t help much.  Why isn’t there a place on the web where people can freely share predictive models?

The next choice is finding or choosing the model I will experiment with. In predictive models, you have 4 choices to choose from: 1. Classification 2. Regression 3. Clustering 4. Association Rules.

Data Mining —- Future   —- Modelling —- Classification

                                                                   \_  Regression

                                                                      \_  Clustering

                                                                        \_  Association Rules

I have identified 3 predictive algorithms to start with to discover network attacks. I will create simple algorithms in each of the categories of k-means clustering, k-nearest neighbour, and association rules.  A variety of papers can be found simply by searching Google for each type of algorithm concatenated with “network attacks”.  There is quite a bit of math and theory around these techniques that I am not familiar with.  Notwithstanding, I will try to explain and create trivial predictive algorithms in my next series of blog posts.

Read the second instalment of this blog series on building your own Predictive Analytics Engine on k-means clustering, k-nearest neighbour, and association rules.