Recent Entries

An Elk. With scales!

This week I have mostly been fiddling with Docker and Elasticsearch/Logstash/Kibana. (Known as 'ELK').

The basics are something I've fiddled with before - elasticsearch is a NoSQL database that's built to shard and scale. Logstash is a log parsing tool, which extracts log metadata and inserts it into ... well, a variety of databases, but in this case I'm using elasticsearch.

And Kibana is a visualisation tool, that - amongst other things - has a configuration for doing logstash parsed logs out of an elasticsearch back end.

I've tried this before - and it worked fine - but what I wanted to try this time is making a scalable system. And thus docker containers. If you haven't encountered them, they're ... sort of like a mini virtual machine. You create a docker image - which is essentially an application, but bundled with all it's dependencies.

And from the image, you create containers - runnable instances of an application. But the key point is, each container is ... well, self contained. All the dependencies are bundled up together, which makes them particularly portable - relocate and start wherever you need/want. (Well, provided you have at least a basic docker build - the whole point is you don't actually need to install much else).

But the thing I was trying to do here is use a private docker network, and create a set of containers that would basically auto-configure - allowing you to 'spin up' extra nodes as you need to.

With the elasticsearch database this is working nicely - because you're instantiating containers off images, you need to think in terms of persistence. You can therefore create and attach a 'storage' container, that _is_ persistent - and just attach to that with your current elasticsearch image.

But the base 'discovery' mechanism is an IP unicast, which allows you to specify a set of 'discovery' nodes to find the initial cluster. It works well enough, but it does require you have a particular set of IP addresses active.

Logstash/Kibana is a bit less good at the dynamic discovery, so I'm still working on it. Logstash, given it's near-real-time nature it shouldn't be too hard to start/stop and do node discovery as part of the startup script, but Kibana it's a bit less easy.

So I'm thinking I might try looking at haproxy next, or some other discovery mechanism.

But otherwise, as it stands - I've got a container 'set' that it took me about 10 minutes to start up an extra 'node' in my cluster, to add storage/compute resources. (And most of that was installing the updates I needed for docker-engine to do the multi-host network).

So all good so far.

Entry tags:

perl

Splice CSV files in perl

use strict;
use warnings;

use Text::CSV;
use Data::Dumper;

my %count_of;
my @field_order;

foreach my $file (@ARGV) {
    my $csv = Text::CSV->new( { binary => 1 } );
    open( my $input, "<", $file ) or warn $!;
    my $header_row = $csv->getline($input);
    foreach my $header (@$header_row) {
        if ( not $count_of{$header} ) {
            push( @field_order, $header );
        }
        $count_of{$header}++;
    }
}

print "Common headers:\n";
my @common_headers = grep { $count_of{$_} >= @ARGV } keys %count_of;
print join( "\n", @common_headers );

my %lookup_row;
my $key_field;
if (@common_headers) { $key_field = pop @common_headers }

foreach my $file (@ARGV) {
    my $csv = Text::CSV->new( { binary => 1 } );
    open( my $input, "<", $file ) or warn $!;
    my @headers = @{ $csv->getline($input) };
    $csv->column_names(@headers);
    while ( my $row_hr = $csv->getline_hr($input) ) {
        my $key = $.;
        if ($key_field) {
            $key = $row_hr->{$key_field};
        }
        $lookup_row{$key}{$file} = $row_hr;
    }
    close($input);
}

my $csv_out = Text::CSV->new( { binary => 1 } );
my $header_row = \@field_order;
$csv_out->print( \*STDOUT, $header_row );
print "\n";

foreach my $key ( sort keys %lookup_row ) {
    my %combined_row;
    foreach my $file ( sort keys %{ $lookup_row{$key} } ) {
        foreach my $header (@field_order) {
            if ( $lookup_row{$key}{$file}{$header} ) {
                if (   not defined $combined_row{$header}
                    or not $combined_row{$header} eq
                    $lookup_row{$key}{$file}{$header} )
                {
                    $combined_row{$header}
                        .= $lookup_row{$key}{$file}{$header};
                }
            }
        }
    }
    my @row = @combined_row{@field_order};
    $csv_out->print( \*STDOUT, \@row );
    print "\n";
}

A Bad Person

There have been instances recently of people saying or doing something inappropriate, and there being an associated furore over it.

A scientist making sexist comments about 'distractingly sexy' women in the lab.

A guy wearing a 'pin ups' T-shirt when talking about a space mission.

Or the whole 'sad puppies' thing around the Hugo awards.

Even "gamer gate".

The problem is in these scenarios, that there seems to be an urge to categorize people as either 'good people' or 'bad people'. And then there's a massive debate.

There is just no such thing as an unambiguously good - or bad - person. It's never so simple. There's no ethical calculus that lets you be a saint for 40 years, and then get a free pass on murdering a baby or two.

Nor do you -ever- get to 'cancel out' past mistakes. You can seek redemption, but the only way you ever get it is via forgiveness, not fixing the past.

So it's actually quite harmful to apply this sort of abstraction. We saw this in the Jimmy Savile affair (and many many other examples). People who couldn't believe he was doing what he was doing, because of all the good things he did.

And the truth is - he did both. He _did_ do a lot of good, and raise a lot of awareness and supported charities. We're fools if we dismiss that in light of the subsequent revelations.

But likewise - that doesn't excuse - or prevent from happening - the _other_ things he (allegedly?) did.

And the same is true of pretty much everyone. Everyone has a price. Everyone has pressure points. Everyone has weaknesses. Everyone has prejudices. Everyone makes mistakes.

And in some cases - these prejudices, mistakes and weaknesses lead to causing a harm that can never be repaired. And you just have to live with that. That doesn't mean you cannot do better, or indeed that you become irredeemable. It doesn't make you unforgivable.

Think how horrible _that_ would be. One mistake, you're now a 'bad person' and that is that.

A lot of people misunderstand what it is to forgive. It isn't about letting something pass, and say 'never mind, it doesn't matter'. If it didn't matter, it wouldn't need to be forgiven. It's about letting go of _your_ pain, anger, hate or fear. Acknowledging a harm - understanding that it hurt and always will - but letting go of it's ability to control your future.

Continuing to hate someone is ultimately very poisonous. It taints your world view massively.

That doesn't mean we should let "wrong" pass - not by any means. Challenge it whenever you can, especially when it's at a point that by doing so, you can change the course of a thing. But merely accept that every person in the world is a complex bundle of ambiguities, and picking fights and censure rarely changes anyone's opinion.

And you never get to 'fix' the past.

This started as a comment on a facebook post, but turned into a bit of a rant

Daily Checks: You're doing it wrong

One of my pet peeves with IT. (There's many, this is just one) is the notion of 'daily checks'.

Some places have a daily checklist, that's a list of tasks they have some IT person look at, each day, to make a note that everything is OK.
This is based on a fundamentally flawed assumption - that somehow a human is better at a routine task than a computer.

This is just plain wrong. Yes, there are some things that a person will be able to spot that a computer won't. But these are not things that go on a daily check list. They're the things you see when you roll up your sleeves and do an end-to-end diagnosis.

Otherwise... a computer can do a 'daily check' much more frequently than a person can. It can do it all day, every day. And it can notify you when there's a problem. By doing so, you don't get the 'road blindness' effect - people are bad at paying attention to persistent states, they're much better at picking out anomalies.

If the light is always yellow on your system, then you won't notice when a _new_ 'yellow' alert shows up.

So really - if your 'daily check' is any more involved than 'check your email' or perhaps 'open your monitoring portal' - then you're doing it wrong. Make your computers keep watch on each other, because that way you'll know what's wrong, when it went wrong and you won't have to wait up to 24 hours before you spot the problem.

Which lets face it - if anything is significantly wrong, your phone will already be ringing anyway.

Why requiring Real Names online is evil

I'm sure many of you will have notice - several accounts have been disabled by Facebook, thanks to their policy that 'thou shalt use real names'.

They're under some misguided concept that anonymity leads to trolling, or maybe it's just marketing.
Whatever. It's extremely misguided.

I've used my real name of Facebook since the start. It's not particularly bothered me. However I'm also quite well aware that as a straight, white, middle class citizen of a first world country... the world is actually pretty well geared up to my convenience.

There are quite a few reasons why someone might not want to use their real name - but oddly, the vast majority of them _don't apply_ to the straight, white, middle class citizens of the world such as say, Mark Zuckerberg or the majority of employees of Facebook. http://www.usatoday.com/story/tech/2014/06/25/facebook-diversity/11369019/

Now, leaving aside _malicious_ reasons to 'fake up' a facebook persona (I can think of a few. Fraud, stalking, anonymous trolling).

There's also some really quite serious reasons why removing anonymity is an evil thing to do:
- Victims of domestic abuse. Children might want to avoid being tracked down by an abusive parent. Partners might want to avoid being tracked down by an abusive ex.
- Victims of crime in general - as a straight, white, middle class male you will likely never have to worry about being targeted by a rapist.
- People who work in particular professions - are seriously disadvantaged by their personal and professional lives collide. Police officers, teachers, social workers... are all much more at risk to online abuse as a result of what they do.
- LGBT individuals, who risk harassment or abuse - lets not forget, there are some countries that treat homosexuality as a crime.
- Political activists and dissidents - How much do you trust your government anyway? Do you think people who live in say, Syria, or China feel the same? Do you really think it's 'fair' to put these people at risk?
- People in responsible positions, such as banks who are at risk of being targeted by organised crime.

One of the things I particularly remember is a person I met who used to work in the prison service. One of the things he had hammered into him on his first day: DISCLOSE NOTHING. Because they'll be in close proximity to some serious criminals for extended periods. Some of which were clever enough and sociopathic enough to put together a profile on 'the screws'. And if they ever completed their picture, at some point a nice man would show up outside their house, and ... would probably offer to do them a favour. But would make clear that refusal wasn't an option, and things could get extremely nasty if they didn't want to co-operate.

Just by disclosing their favourite pub and which footie team they supported... they'd been tracked down, and their family put at risk. Coerced into corruption.

At which point they're basically screwed. Quitting your job might remove your risk of being corrupted, but there really is no guarantee that a criminal group won't retaliate. Not everyone can afford to provide their own personal witness protection scheme. Until you've experienced what a systematic campaign of petty harassment and vandalism can feel like - you really don't appreciate just how horrifically destructive it might be.

But the thing is - Google started requiring Real Names when they started G+.
They've since dropped it, presumably because they realised that 'Don't Be Evil' didn't really include putting people at risk simply because they've never had to worry about being victimised.

http://www.zdnet.com/google-reverses-real-names-policy-7000031642/

And to top it all - a firstname/lastname is really a very westernised view of the world. Not every country works like that.

So seriously Mr. Zuckerberg and Facebook. Think hard about the people who'll be burned by this policy.

Entry tags:

Serialising objects in Perl, using Storable to pass them around threads

You may have found Object Oriented Perl useful - it's not a tool for every job, but if you've a complicated data model, or a data driven process, it's invaluable. (Not to mention code encapsulation - but that doesn't actually seem to come up as much).

You may have also found being able to thread and queue useful. (Perl threading and queues)

However what you'll also have probably found is that multithreading objects is a significant sort of nuisance. Objects are jumped up hashes, and there's some quite significant annoyances with sharing hashes between threads.

However, another module which I find very useful in this context is Storable (CPAN).
What Storable does - essentially - is allow you to easily store and retrieve data to the local filesystem. It's geared up to hashes particularly:

use Storable;  
store \%table, 'file';  
$hashref = retrieve('file');

This is quite a handy way for handling 'saved state' in your Perl code. (Less useful for config files, because the 'stored' file is binary formatted).

However what Storable _also_ supports is objects - which as you'll recall from the previous blog post are basically hashes with some extra bells and whistles. Better yet, there are two other methods that allows Storable to store to memory.

my $packed_table = freeze ( \%table );  
my $hashref = thaw ( $packed_table );

This also works very nicely with objects, which in turn means you can then 'pass' an object around a set of threads using the Thread::Queue module.

use Storable qw ( freeze thaw );  
use MyObject;  
use Thread::Queue;  
  
my $work_q = Thread::Queue -> new();  
  
sub worker_thread {  
  while ( my $packed_item = $work_q -> dequeue )  {  
    my $object = thaw ( $packed_item );  
      $object -> run_some_methods();  
      $object -> set_status ( "processed" );  
       #maybe return $object via 'freeze' and a queue?  
  }  
}  
  
my $thr = threads -> create ( \&worker_thread );  
my $newobject = MyObject -> new ( "some_parameters" );  
$work_q -> enqueue ( freeze ( $newobject ) )  
$work_q -> end();  
$thr -> join();

Because you're passing the object around within the queue, you're effectively cloning it between threads. So bear in mind that you may need to freeze it and 'return' it somehow once you've done something to it's internal state. But it does mean you can do this asynchronously without needing to arbitrate locking or shared memory. You may also find it useful to be able to 'store' and 'retrieve' and object - this works as you might expect. (Although I daresay you might need to be careful about availability of module versions vs. defined attributes if you're retrieving a stored object)

Kerberos - the gatekeeper.

You may not have heard of Kerberos. But there's a pretty good chance that you've used it, if you've used Windows in a place of work in the last ... 10 years or so.

It's a method of single sign on, designed in MIT about 20 years ago. It's really quite clever - so much so, that no one's managed to beat it in that time. It was intended to be a way of authenticating users in an untrusted network, for Unix.
Ironically - it was Microsoft that turned it 'mainstream'. Active Directory is - basically - a combination of Kerberos and LDAP. (Which are the two key elements of a Kerberos authentication domain).

The reason it's quite clever? Well, prior to it's invention, Unix (and Windows) basically were an account per server. It had extended a little into 'shared' accounts with things like NIS and YP. (Which is basically a 'shared' account list, that each server can authenticate if it wishes).

But you still had to type a password in, each server you logged in to. You could set up some sort of 'override' (rsh 'authorized hosts' and later ssh public/private key pairs) but it didn't handle network level authentication.

What kerberos does, is allow you to 'declare' your identity to an authorisation server (Kerberos Domain Controller - which in Windows is an Active Directory domain controller). It uses encryption to handle the authentication mechanism - which is another clever innovation, because you then don't have to send your password in the clear.

You encrypt - locally - a message. You send it to the DC. Which then - because it 'knows' your password, can decrypt the message. And send you one back, encrypted the same way. To prevent shenanigans, you it requires you to encrypt the time, to make replay attacks harder. (Which is why AD/Kerberos breaks when your clocks are >5m out of sync).

It issues a 'ticket granting ticket' (TGT). This is a 'backstage pass', and - provided it's still valid - can be used to request access to other services in the network. You request access to another service by 'asking' for a ticket for it - the KDC then (because it knows the 'machine account' password for the server) sends _you_ a ticket, containing an (encrypted) authorisation. The server you're trying to access can decrypt it (using it's machine account credentials).

And because stuff is handed around encrypted (Kerberos doesn't explicitly specify encryption mechanisms) you get a way of proving you are who you say, and that your remote server is also the one you expected to be talking to - the message can only be decrypted by it's intended recipient.

It's actually pretty cool - Single Sign on is something that remains a challenge to implement (securely/safely). And Kerberos is about the only game in town.

"Sorry we missed you"

Is there anything quite as annoying as waiting for a delivery that doesn't arrive?
I think I have one - the delivery that arrives, but the driver doesn't even bother to knock on your door.

Interlink Express - marvelous chaps - were due to deliver my parcel today.
They even gave me a very specific 1 hour time window, and a button to reschedule, via a text message.

I was due to be working from home, so I didn't do any rescheduling. So at about 12:15, I heard my letter box rattle - that's about when the post arrives, so I wandered over to see what had arrived. (Probably more bills :().
But no, it was a 'sorry we missed you' card from Interlink express, explaining how - because I wasn't in to sign for this consignment - they had taken it away again.

Now, I was prepared for this eventuality - after all there's no guarantee I wouldn't be on the toilet when my parcel arrived, or something.

However - I'm absolutely, 100% certain that the guy didn't even knock. Because I heard the letterbox rattle, as the card came through it, and - just about - heard the sound of a van leaving as I got to the doormat.

I'm bemused. This delivery has cost me ... oh, £6.99 I think it was?
But they've clearly gone to the effort to drive to my house, find the right door, and put a card through it. Is there some mystery reason why the last 30s of effort might be just a bit too much?
Do delivery drivers have a time window 'per delivery' which they have to be careful not to overrun? So are in danger of getting to 14m30 on this delivery, and simply not have time to _actually_ unload it, because they only have 30s left?

I'm really not sure. All I know, is I'm immensely irked by the frankly shoddy customer service this represents. I was at home, waiting for the delivery - and they've treated as a game of 'knock down ginger'. (only without the knocking).

PS - I hope Interlink Express customer services have set up google alerts, and therefore will see this post. Hi there. I'm a disappointed customer.

A morning of hedgehogs

The Vale Wildlife hospital again, put out a call for volunteers on Facebook - at this time of year in particular, they're really rather busy, and it's the holiday season. So I sort of got volunteered - their hedgehogs needed a bit more attention.

With a day starting 'before 8am' (on a saturday *shudder*) I made my way to the hospital. Made a brief hello, and then started to give the guy who actually knew what he was doing a hand. The work at hand was the hedgehogs. This was a ... room? shed? well, place with a number of hedgehog enclosures. They were due their weekly weighing, feeding and cleaning. Mostly there were recuperating patients, who'd been admitted for a variety of reasons - usually 'got stuck somewhere' or otherwise needed rescuing - one had fallen in a swimming pool, for example, and had needed rescuing.

A few, were pregnant or had recently given birth - a few enclosures had '5-ish' hoglets. Which look a lot like miniature hedgehogs, but their spines haven't actually hardened and gone prickly, and they're just a bit smaller and more wobbly. Working down the line of enclosures, involved scooping the hedgehog(s) out with a pair of leather gloves I was on the 'larger end' of their volunteers - most, particularly on this day, seem to be female and college/university age, and so the gloves in the room were just too small.

(This is the 'before' photo - these little ones are in the process of being moved, one at a time, to a new, clean enclosure, after being weighed).
( Because pictures )

But all in all, an interesting sort of a day. I'm unfortunately not going to be in a position to do a regular shift at the hospital, but think I shall keep an eye out when for when they're short handed again. (I'm on the list as being able to fetch rescuees on my way home from work, which is a little less time intensive.)

An odd sort of Sunday afternoon: An expedition with a bat

A couple of weekends ago, on a bit of a quiet afternoon, we took a bat in a box for a ride in the car. A had spotted a post on a facebook group from the Vale Wildlife hospital, that ... they'd had a report of bat that needed rescuing, and they were really busy. (They basically always are).
With my vast experience of bats (e.g. none at all) and being at a vague loose end, we volunteered to go be bat-taxi. Slightly concerned that - as a protected species - handling them was a 'no-no' we were re-assured that it was in a box.

Off we went to a hotel in south gloucestershire, to fetch said bat. It's a rather pretty place, and I trundled into reception to declare 'I'm here about the bat'. I was treated to a stroll "below stairs" - an odd split, as the luxurious plush carpets and rich furnishings gave ways to white washed walls and slightly battered lino. And there was the bat - in it's 'box'.

... which when I though of a box, I though 'with a lid' but it turns out their definition was more like a flimsy cardboard tray. (I can only assume they hadn't worked out that y'know, bats can fly).

When asked 'so what species was it' I had to ad-lib slightly, and point out that bats really weren't my field. (Implying perhaps I had any clue whatsoever about ... well, any form of wildlife at all). (At the hospital, they were happy to tell me that it was a pipistrel).

So upon getting back to the car, there had to be a hasty bit of tissue box vandalism, just to ensure the bat wouldn't, in fact, be 'exploring' the bat-taxi. Thinking about it - it'd probably be less of a problem than a bird, because at least bat can 'see' the windscreen. But even so.

A had it on her lap in the front seat, holding the 'tissue box' type lid, and off we went. A few miles down the road, we realised that the combination of 'dark' + 'airconditioning' might be ... well, a bit like 'night time' and the bat was starting to wake up and wriggle. But with the wriggling, the cunning plan was to press on, and hope it didn't get out.

A little further on, the bat had found the edge of the box, and was trying to squeeze out, and A could feel it tickling her hand as it wriggled.

We got to the hospital without further incident - only to find that the wriggling had been the bat finding somewhere cozy and warm to hide - Almost in the palm of A's hand.
We weren't entirely sure if the aforementioned 'no handling' law really applied to bats coming to sit on you, but thought they might be tolerant of the fact that it was for the purposes of getting it to the wildlife hospital.

Said pipistrel was admitted overnight, fed and watered, and was to be examined by a specialist in the morning - with an aim of recovery, then release - there wasn't any signs it was any more 'ill' than 'got lost and stuck in a hotel room'.

Book Review - A Madness of Angels, Kate Griffin

This is the first in ... sort of two series of books. (as in, there's two sets in the same world, but following different lead characters, mostly).

It's an 'urban magic' books, set in London. Having typed that, you may be thinking of several other really good examples of 'urban fantasy' - such as Dresden Files, or Alex Verus, or perhaps Rivers of London. It's a little like those, but perhaps more the latter.

I have to say, I found it somewhat hard going at first, because - well, because I was expecting more of the same - a Wizard, who happens to live in a city type of story. And it's not like that at all. It's more sorcery and shamanism than wizardry. By which I mean - the 'magic' of the city is bound to the patterns of life _in_ the city, so some of the time, the story telling seems almost dreamlike.

It also starts in a bit of a rush and confusion - which is difficult at first, but gets easier. Bear with it - the protagonist has been out of circulation for a while (which helps with introducing you to the shape of the world).

It's also incredibly evocative - the underpinning principle is that magic is life, and the power of a modern sorcerer (or shaman, or warlock, or wizard) are innately tied to the patterns of life within the city. I like that it's set in London - which is a city with an awful lot of history to it. And that history is part of the magic. So you have the 'powers of the city' - the bag lady, the beggar king, the neon court, the graffiti artists. You have the magic of pigeons (which see everything) and foxes. You have the power of a warding, based on the terms and conditions of the London underground, and graffiti paint being (potentially) magic sigils.

It's a different sort of thing, because it is innately tied to the magic of a city, and I think it's really marvelous as a result (if slightly harder going).

Pr0n!

In the news today, is some headline grabbing nonsense about protecting children from the evils of porn.
I'd like to suggest that this is just nonsensical - almost all the sensational nonsense is generally about cheap titillation and scaremongering.

What goes on between two (legally and informed) consenting individuals is none of the business of state, or indeed anyone else.

There's various types of extreme porn that are illegal - and personally, I think that distracts from the important point. Because at the end of the day, no matter how extreme, the depiction, is only a picture. A depressing or disturbing one maybe, but still - just a picture.

The _problem_ is two separate things:
- Harm done to the subject. Especially when consent cannot be given (e.g. because of being too young). If abuse is committed, then that's a crime in and of itself.
- Harm done to the 'viewer'. It's hard to say for sure what effect repeated exposure to disturbing content actually has, but there's suggestions of links between extreme porn and future abuse. Correlation doesn't imply causation though - there's nothing to say that that link hasn't reduced the future abuse, rather than increased it.

But in neither case do we really do much good by trying to censor the internet. The WHOLE POINT of the internet is it's uncontrolled and uncontrollable. Trying to control search terms is on a part with trying to ban drugs by current street name - an exercise in futility, because as soon as one gets banned, there'll be a new one in use.

I would suggest instead that disturbing and damaging porn is a mental health problem - not a crime (in and of itself - obviously if people are harmed, then that's a crime in it's own right). You can't fully protect children from exposure to disturbing concepts, and by far, assuming that the magic of the internet is going to... well it is a deeply flawed assumption. (There's not many parents who would call themselves more tech savvy than their teenage children either).

It's far better to engage and understand - from all directions. Don't censor or censure, but encourage openness. And yes, that does mean that some people with some quite disturbing fantasies will come to light. But far better that, than the problem being suppressed until it's far too late for some innocent victim.

The internet is a real power in our society today - ideas and concepts can be moved around like never before. This means all sorts of good things happen as a result. It also means all sorts of bad things can too - there's a lot of nastiness buried in the human psyche, and that'll never go away. But you can shine a light on it, and reveal it for what it is.

The large filesystem problem

I'm currently musing on a difficult problem. Given a large storage estate, which contains some large filesystems, what is an efficient way to process 'the whole lot'.
As an illustrative case - take virus scanning. It's desirable to periodically scan 'everything'. There's other scenarios such as backups, accounting and probably a few others.
But it's lead me to consider it - given an order of magnitude of a petabyte, distributed over a billion or so files. What is an efficient way to do it?
Again - take the same illustrative case. A virus scanner, that can process 100k files per hour. At that rate, you're looking at 10,000 hours - or a little over a year. Even if you could keep a system doing that all the time, you're still faced with - potentially - having to keep track of how far you got, on something that's changing as you go.

So with that in mind, I'm thinking about ways to scale the problem. The good bit is - as you end up with substantial numbers, you also have a lot of infrastructure to make use of - you can't physically get to a petabyte, without a lot of spindles and controllers. And that usually means array level readahead caching too.
Which means optimally, you'll 'go wide' - try and make use of every spindle and every controller at once. And also, ideally doing it whilst maximising readahead efficiency, and minimising contention. (And of course, given the timescale you almost certainly have to 'fit in' with a real production workload, including backups).

The problem can be simplified to - given a really large directory structure, what's an efficient way to traverse it and break it down into 'bite size pieces'. Again, following on the virus checking example - maybe you want to break down into '100k file' pieces, because then each chunk is about an hour of processing, which can be queued and distributed. And then you will scale this, by taking each filesystem as a standalone object, to be traversed and subdivided.

You may also end up having to do something similar in future too - again, virus checking - you probably want to repeat the process, but you can then apply some sort of incremental checking (e.g. check file modification times perhaps, although that maybe unwise unless you can verify that the file actually is unchanged).

The other part of the problem is - well, you can't easily maintain a long list of 'every file' - for starters, you already essentially do that - it's called 'your filesystem'. And otherwise you're looking at a billion record database, which is also ... well, a different scale of problem.

So I've started reading about Belief Propagation https://en.wikipedia.org/wiki/Belief_propagation - but what I'm thinking of in terms of approach is to - essentially - use checkpoints to subdivide a filesystem. You use a recursive traversal (e.g. similar to Unix's 'find') but you work on a 'start' and 'end' checkpoint. Skip everything until start, process and batch everything up until 'end' checkpoint.
Ideally, you'll measure distance between your checkpoint as you go, and 'mark off' each time you complete a batch.

For the sake of parallelising and distributing though - I'm thinking that given you _can_ tell a number of inodes allocated to a filesystem (which approximates the number of files) you can then tell how many 'checkpoints' you would need within that filesystem. At which point you start traversing downwards, in depth order, until you get a number of directories that are in the right order of magnitude - and use each of those as your first set of checkpoints. As you run, redistribute the checkpoints by simply taking - for a batch size of n - take a new checkpoint every n/2 files, and if the distance between the first and last checkpoint is less than n/2, simply delete it. That should mean you get 'checkpoints' between n/2 and n in size. There'll be some drift between iterations, but as long as it's within the same order of magnitude, that doesn't matter overly.
Start 'finding', accumulate 'a few' batches, and then leave them to be processed, moving on to a different 'part' of the storage system, to do the same. (Separate server, location, whatever). You don't want your search to get too far ahead of your processing - you're probably looking at memory buffering your batches, and having too much buffered is a waste.

But it'll always bit a bit of a challenge - fundamentally, there's only so fast you can move substantial volumes of data.

Object Oriented Perl

I started taking a look at object oriented perl the other day. Mostly because I was deconstructing something that didn't work quite right. Anyone a little bit familiar with Perl, will realise that the .. they've probably already seen it, because the in Perl, OO is driven by hashes, references and packages.

(Here's a hint - any time you've used '->' that's probably calling an object, and - because OO lets you encapsulate - there's a lot of that in imported modules).

The basics are - an object is a package, with an internal hash. And ... that's about it.
There's one 'new thing' that you may not have seen - 'bless'. Which is perl's way of giving a generic reference a class. Because they're applicable to objects, you'll see the subroutines within the package referred to as 'methods'.

You 'use' a method, with '->'. This is exactly the same as just running the subroutine, but perl passes the object reference (that you 'blessed') into the subroutine as the first argument.
my $object -> get_value ( "fish" );
Is equivalent to:
&Package::get_value ( $object, "fish" );

You then rely on the 'get_value' sub to 'know what to do' with $object. (Which is one of the underlying principles of OO - you ask it to do something, you don't deal with how it accomplishes it).

By convention too, packages should include a method 'new' - a constructor that sets up the blessed reference, and does any other initialisation that's necessary. (Doesn't have to be called this, but it's usually a good idea). Similarly - 'internal' subroutines are prefixed with _ to indicate they shouldn't be called directly. Unlike stricter languages, perl doesn't enforce privacy within objects. You _can_ diddle with attributes and internal methods, but it's asking for future pain, so don't do it.

If you create a sub called 'DESTROY' then this is called when an object would be deleted (usually due to going out of scope, or on program termination).

That's about it, really. Quite a bare bones implementation. If you want more 'OO' style features, there's a module called Moose, which implements a lot of more advanced features.

Here's some sample, illustrative code:

MyObject.pm:
#!/usr/bin/perl

use strict;
use warnings;

package MyObject;

sub new
{
  my ( $class ) = @_;
  print "New called\n";
  print join ( "\n", @_);

  my $self = {};
    #need to give self something, because it needs to be 
    #a reference to something - in this case, an empty hash
    #you don't need to do this if you do something like:
  
  #my $self;
  #$self -> {_description} = "New Object"; 
  
  #because if you do that, self is no longer an undefined scalar, it's a reference.
  print "And Done\n";
  
  bless ( $self, $class ); 
  #note - the return code of 'bless' is the object reference.
  #perl implicitly returns the result of the last operation
  #so this 'return' below would occur implicitly if bless were the last
  #line in the sub. 
  return $self;
}

sub print_something
{
  my ( $self, @args ) = @_;
  print "Printing something (", $self, ") : ", @args, "\n";
}

sub set_description
{
  my ( $self, $desc ) = @_;
  $self -> {_description} = $desc; 
}

sub get_description
{
  my ( $self ) = @_;
  return $self -> {_description};
}

sub DESTROY
{
   print "Tidying up the object\n";
   print "Args of:".join ( "\n", @_ ),"\n";
}

1;

Code to drive 'MyObject':

#!/usr/bin/perl

use strict;
use warnings;

use MyObject;

{
  my $object_for_me = MyObject -> new();
  $object_for_me -> print_something("Cool");
  $object_for_me -> set_description ( "New Description" );
  print $object_for_me -> get_description, "\n";
  print "Doing it 'subroutine style'\n";
  &MyObject::set_description ( $object_for_me, "Different Description" );
  print &MyObject::get_description ( $object_for_me ),"\n";
}

print "Ending program\n";

Entry tags:

pirate maelstrom

The Pirate Alphabet

Inspired by the crowd at Maelstrom, and mostly driven by

the_wood_gnome.
(Repost with minor redrafts)

If you're struck by inspiration for a letter - the dirtier the pun, the better - then please let me know. (And in true piratin' form, these are more like guidelines, than actual rules)

( The Pirate Alphabet )

In which we turn a Raspberry Pi into a bandwidth monitor for (sky) broadband

Following on from my previous post about why RRDtool is awesome.
A worked case study.

First off, we take the Raspberry Pi.
Install 'rrdtool' using:
sudo apt-get update rrdtool
And the perl library:
sudo apt-get librrds-perl

Sky routers have a router stats page, on http://192.168.0.1/sky_system.html
(You will need your router username and password)
You can check it works with
wget --http-user username --http-password password http://192.168.0.1/sky_system.html

There's a table in there that looks a bit like:
( Read more... )

RRDtool is amazing

I've been playing recently with a piece of software that I keep coming back to, and discovering new coolness.
It's called RRDtool.

RRD stands for Round Robin Database. And what this tool does is allow you to insert time based statistics into an RRD, and extract them later as graphs. It includes automatic statistic aggregating and archiving, which means it's ideal for ... well, all sorts of statistics really.

It's used by a spectacular number of utilities - including Cacti, MRTG, and - the way I first ran into it - Big Brother. Here's the full list.

But the tool itself it really very useful - there's all sorts of things that have performance counters, and ... RRDtool is almost perfectly suited to collating them - allowing you to sample information at almost any frequency you choose - and then collate them from high resolution diagnostics, to longer term trends.

It's easier to get going than you might think - first of all, you create an RRD. You do so by giving it a 'Data Series' (DS) - which defines the input data. And an Round Robin Archive (RRA) - which defines resolution, retention and auto-archiving.

And then you insert into your RRD samples collected, which it turns into consolidated data points - which can be extracted and turned into graphs (very easily, but you've got a very powerful graphing tool that also allows you to add in formula and transformation if you so desire). Or just pulled out as a set of data points.

But the reason I've been particularly impressed with it recently, is because it implements something called 'Holt Winters forecasting'. Now, to save too much brain ache, what this is is a technique for smoothing off a graph. But the important part is that it includes a seasonal variance (and by 'seasonal' it means in a statistical sense - in most cases in IT, as 'season' is a day) which means you've a mechanism to smooth and predict - along with an expected variance, based on your seasonal trend.

This means that - rather than setting a threshold of 'bad' and 'good' on your system state (which rarely works well, because a lot of system statistics are very hard to give such a binary answer) - you can instead detect aberrant behavior - simply count (on a rolling window) the number of times your measurement strays outside the expected variance, and flag an error if it does.
This is really very cool indeed. Self tuning statistics that can bring themselves to your attention when they're 'interesting'.

I would also note - this has lead me to discover R - a statistic modeling and manipulation tool. What this has helped with, is tuning Holt Winters parameters - it has parameters for adjustment rates of statistics - as with any smoothing algorithm, you set a 'weight' for adjusting the curve. The answer to 'what should I set these parameters to' is 'it depends' normally.
What R will do is (easily) let you feed in your samples, run 'HoltWinters' on it, and spit out the optimal parameters based on your data. (It does least-squared regression to find which parameters provide a 'best fit').
This too is awesome. (And R can do amazing amounts of other stuff too, which I like).

Anyway, if - like me - you like being able to see traffic graphs, CPU loads, average/peak concurrent user trends, response times and all sorts of stats in graphical and historical form - this is exactly the tool for the job.

Empire: Event Two

The event started on Thursday for me - getting to site, setting up and saying 'hi'. Which is just as well, as during Friday a really impressive amount of wind and rain was responsible for the destruction of an impressive number of tents.
Despite that though, shortly after 18:00 on Friday, the weather turned and we had glorious weather the rest of the event.

And it really did make it a blast. The potential visible at the first event, that was suppressed by the cold... sprang into life. There was no shortage of things going on - the only quiet time really was when 'everyone' was off on the battlefield. (And that was fine, as it gave time to stock up on bacon).

Between the different camps being beautifully dressed, the various people really trying (and succeeding) at giving their nations a real 'feel' which made just wandering around a joy. And no shortage of 'Stuff' going on, between the deliberations of the Synod, Senate and Conclave. (And presumably the Bourse/Military council too).

So actually, didn't really end up doing much, aside from wandering around and talking to people. Which wasn't really a bad thing - I really enjoyed it. But I think I do need to up my game a little, and get involved a little more.

Doctor Who - Shightmare in silver

I have to say - I really had high hopes of Neil Gaiman's second episode - in what has been a lacklustre season so far, it ... well, frankly I was hoping to see the same magic of the Doctors Wife. Sadly not. I'm not sure what's going on with this season, but it really hasn't managed to deliver any of the really cracking episodes in the previous seasons.
( In case of spoilers )

Last Night's Dr Who (In which there by spoilers)

( Because of spoiler )