Over the last four years we have witnessed a significant change in both the approach to building relevant departments and the understanding of what a professional in the field is actually expected to do.
In 2012 when we initially found ourselves responding to huge demand for Data specialists, we found ourselves very quickly beginning to speak to companies who were sure that they wanted to hire a junior or senior Data Scientist or build a team of them. They thought that they were ready for a Data Scientist due to the very fact that they had a lot of data (this in itself was questionable in many cases though) and they had read that with enough probing the Data in question would yield fantastic commercial insights capable of producing huge profits.
In many cases, we successfully helped them to find the right people. Some companies would even hire a few Scientists at once.
And yet, despite all this perceived success (the clients had their Scientists that were going to lay the Golden Egg, and we had filled our brief) there was an obvious issue. Companies new to Data Science had built a house without the foundations.
It can be argued that Data Engineers are the unsung heroes of the Data Science field. It’s simply not possible to do good Data Science work (which by very definition should be scalable for wider experimentation/innovation) without a decent environment to operate in.
Many companies had asked the Data Scientists to do their work without a key store of central data collection – all the important stuff was unclean, hidden away in disparate warehouses across the business (particularly in some of the larger organisations that we work with) and therefore ultimately useless. Having taken feedback from clients (and indeed, the Scientists that we placed) that this was causing serious issues, our next obvious market at Xcede was provided for us (and isn’t that ultimately the point of useful recruitment?)
Hence, relatively early into the field’s progression here in the UK, we took a strategic decision to recruit for both Data Scientists and Data Engineers on one desk. We could provide a more well-rounded solution to the clients that were building out their Data Science offering from scratch and needed advice on how to do so.
In turn by studying the end to end lifecycle in depth to be able to speak about the processes with some authority, we as a business began to learn exactly what our clients needed (even if they didn’t know the exact name for it yet).
With this knowledge, we’ve begun to give our own names to various types of profiles under the Data Science & Engineering banner and take a side on the debate of what a Data Scientist should do. Check out Robert Chang’s (Data Scientist, Airbnb) article on Type A and Type B Data Scientists for a good read. The term ‘Data Scientist’ as a whole is often too vague when you’re looking for the right person according to the need of a particular business.
I’m now of the mind too that the idea of a ‘Big’ Data Engineer is far too vague and needs some clarifying. As with Scientist types A and B, we’re not saying that both skillsets are mutually exclusive – many Engineers will be capable of performing the responsibilities of both types. The difficulty is that some companies (when looking for people to perform a certain set of tasks) will often struggle to summarise their need, and that can lead to wasted time for all concerned when they might be looking for specialists in one or the other area
We’ve helped clients build out entire Enterprise Data layers, and we’ve also helped them find that engineer who is capable of making sure that the Data Science magic works as well as it should, so at this point we feel that we can give a rough summary of the two distinct ‘Data Engineer’ positions that we encounter.
Data Infrastructure Engineer
In an ideal world as you’re building your Data Engineering team the person who is most capable of performing Data Infrastructure responsibilities should be your first, and therefore most pressing, hire.
This person will effectively be responsible for building a data platform to ingest and process data at scale. Naturally, this could be a clean build on a greenfield project where they can get involved with the selection of the architecture set-up (i.e. what tools to use, whether to go Lambda and combine Batch & stream processing or something simpler) or something a little more transformative involving dodgy legacy systems.
They will maintain the data flow too - at the top level of the field, there’s a need to help manage multiple terabyte to petabyte scale clusters and create easy-to-use systems to handle security, disaster recovery and all sorts of other issues when you’re working with masses of data (especially if there’s a high load situation in play).
Naturally, it depends on your DevOps (/DataOps) philosophy for how robust the maintenance process is and, more pertinently, how much of this any single Infrastructure Engineer handles. In any scenario, there will be a level of responsibility that needs to be adhered too!
For example, we’ve seen that some clients choose to hire additional employees to work on the administration side of things and support the Engineers, freeing them to take on new projects (i.e. if there’s build in a Hortonworks environment, hiring a ‘Hadoop Admin’ can ease pressure).
Basically, when a team decides to designate a Data Infrastructure Engineer, they’re going to be the guys and girls who are responsible for the design and build of the environment to store and process data in most cases
Data Science (DS) Engineer
This is the one that often seems to cause issues. We’ve seen many companies use this title when they don’t know whether they need a Scientist or an Engineer, which in itself is worrying. For us, whilst Data Science Engineers do use the technical skillsets and have elements knowledge of both Scientists and Engineers need, their actual responsibility is way more defined.
We’ve come to class a Data Science Engineer as someone who acts as the essential middle link between an evolved Data Engineering and Data Science team.
As such, they will be responsible for the build and maintenance of a model (or indeed data product) that a company are productionising within their business. Familiarity with practical applications of machine learning methods on (generally) large-scale datasets is vital and the experience of algorithmic implementation (refinement, design, testing and optimization, and deployment) is key.
In relation to the other members of the team then, it is vital that the DS Engineer understands the theoretical model that has been designed by the Scientists, whilst also understanding the technical environment in which they deploy and scale the model which will yields the desires results.
As such, we believe a DS Engineer should be hired in tandem with your Data Scientist, in order to create a smoother (and quicker) timeline to production from the initial design of model.
We’ve pointed out the differences of both in order to be more specific when hiring “Data Engineers”, but I’m keen to emphasise that not all teams can afford the luxury of splitting these responsibilities so cleanly. In start-ups and early stage teams many engineers will have to try and handle both sets of responsibilities, which is pretty tricky (and a great learning curve)! However, as Data Science departments expand and become further ingrained in company culture specialists in both become very important!
If you’re a Data Engineer reading this, how far would you agree with the above? Scientists, do you feel you handle the work load of a Data Science Engineer anyway? Let us know!