13th February 2019 5 minute read
Our latest edition of Intelligence looks at the changing role of data in the business world of today. The role of Data Scientist is increasingly critical in modern organisations, but the job’s challenges and responsibilities are not often fully understood. Data Scientist Simon Coates of Capita’s Innovation Team fills in some of the gaps.
So what does a Data Scientist actually do? A quick online search will give you numerous answers, each slightly different, and invariably using the industry’s latest buzzwords (Machine Learning, AI, Big Data etc). My own definition is that, at its core, a Data Scientist is someone who uses advanced statistical and data engineering techniques to make the world a better place. Sound like a lofty goal? Maybe, but then the role was once described by the Harvard Business Review as the ‘Sexiest Job of the 21st Century’. A few noble embellishments are allowed…
On a typical day I’ll have one project I’m working on and at least two more in the pipeline. The key to any successful data science project is the data itself, and one of the biggest challenges I’ll face on any of those projects will be the volume and variety of data sources. We now have a multitude of data coming from social media channels, web chats and more. While the technology available to store, clean and analyse data is advancing all the time, so too are the technologies allowing customers to interact with companies. Consequently, the amount of data available keeps on rising.
The technological improvements don’t stop there. In the last few years we have also seen rapid expansion of open data sources. For example https://data.gov.uk/ now has over 40,000 datasets that can be used to improve services offered or enhance your own dataset. Using open data sources allows me to dive even deeper into a problem and consider factors I previously would not have been able to, unless I accessed external datasets - potentially very expensive ones. The open datasets available cover a wide variety of sectors and data points. Any quiet time I might have will frequently be spent looking through these to make sure I’m aware of what is available should an urgent requirement come in where these can help in the analysis.
The other big change I’ve seen is the ability to work easily with unstructured data. For Customer Management the main, but not only, source of unstructured data is the conversations customers have with our agents, across all channels. We can now take what was actually said and/or written between the two and analyse that in a way that was not possible even a few years ago. There are various ways to turn unstructured data into structured data. The most commonly used when looking at conversational data is to study the sentiment of the conversation and to pick out key words or phrases, and that can be highly effective for the clients we work for.
At the start of every data science project my first objective is to make sure I’m using the best possible data for the task at hand. This can include working with clients to enhance our datasets with data they own, searching open data sources, looking at other data sources owned within Capita, and finding new and innovative ways to turn the unstructured into the structured – and doing all that while making sure I stick within the GDPR regulations.
Next is ensuring I’m using the right program for the task. There are many different options available when combining, cleaning and analysing data. Each has its own advantages and disadvantages in any situation. A good Data Scientist needs to stay on top of a wide variety of programs, and be capable of using them at an advanced level in order to have the best possible chance to negotiate all the challenges a project will throw up. My program of choice is RStudio, but I make sure I keep my Python and SQL skills up to date too. The main advantage of R and Python is that both are open source and have a large community of users constantly working to make them better. I think the right question to ask yourself when choosing a program is always ‘what’s the best one for this task?’ rather than ‘what that’s one I’m best at using?’
Once I have my data sources finalised, have linked all the data and cleaned and prepared it for analysis, the fun can start – finding answers to specific business problems along with workable solutions that will improve performance.
What does the future hold? Well with the amount of investment taking place in Data Science toolkits on both Microsoft Azure and Amazon Web Services, plus the ever-increasing library of data sources and the need and desire for performance improvements, I think myself and all the other Data Scientists in Capita will be kept busy for the foreseeable future.
For more insight check out our latest edition of Intelligence, 'The Power of Data'. In it, our experts look at how data is changing the shape of modern business, the new rules of the post-GDPR world, and more. Read the full report here.