Speaker Interviews
During the past few days we conducted interviews with our speakers. We are going to share what they told us with you before the conference kicks of here.
We talked to Alex Lloyd, Otis Gospodnetić, Markus Weimer, Anne Veling, Tim Lossen, Alex Baranau, Grant Ingersoll, Leslie Hawthorn, Andrzej Bialecki, Friso van Vollenhoven, Bertrand Delacretaz, Ioan Eugen Stan, Frank Scholten, Robert Muir, Rafał Kuć and Eric Evans.
Alex Lloyd
Could you briefly introduce yourself?
Alex Lloyd is a Senior Staff Software Engineer in the storage infrastructure group at Google. In this role, he led the replication implementations for Megastore and Spanner, global storage systems that underpin a wide array of Google services. He is currently working on distributing multiversion concurrency control algorithms. Prior to his current role, Alex worked with the Blogger team. Before joining Google in 2004, Alex built mobile sync middleware at BEA Systems. As an undergraduate, his dormroom startup led him to his first job at Object Design, where he worked on visual templating tools. Alex has a bachelor’s degree in computer science from Harvard University.
How did you get started developing software?
Driving that little LogoWriter turtle around when I was 7 or 8.
What will your talk be about, exactly?
Spanner is a planet-scale distributed database that Google developed over the last five years. It offers a SQL-based query language, relational data model, and efficient serializable transactions over all data. I will talk about why we wanted SQL at NoSQL scale. I'll also talk about the underlying technologies that let us offer strong semantics at this scale.
Have you enjoyed previous Berlin Buzzwords editions?
Looking forward to my first time.
Otis Gospodnetić
Could you briefly introduce yourself?
My name is Otis Gospodnetić. I'm originally from Croatia, but now live and work (mostly) in New York City from where I run Sematext International, a geographically distributed products and services company focused on Search and Big Data Analytics. I'm the co-author of Lucene in Action and a member of Lucene PMC, Apache Software Foundation, and NYC CTO Club.
How did you get started developing software?
It was either Turbo or Borland Pascal in high school. Then, while at the university, I found myself always building my own applications or services, mainly in Perl back then, and building little ventures on the side (instead of focusing on studies). After that I've done some GUI development, but have mainly focused on backend. Over time that focus narrowed further and I found myself working on search all the time. That's where my passion was/is.
What will your talk be about, exactly?
At Sematext we've put a lot of effort into building really scalable Search Analytics and Performance Monitoring services, both of which run in the cloud and utilize Big Data technologies such as Hadoop, HBase, and Flume. In my talk I'll share our experience building and running our multi-tenant Performance Monitoring SaaS from the cloud. We'll share info about all components end to end, from the point where performance data is collected to the UI where it is graphed. Since this service provides monitoring for Solr, ElasticSearch, HBase, and Sensei, people using any of these technologies may be interested in seeing where their performance data may go and how it is processed.
Have you enjoyed previous Berlin Buzzwords editions?
Oh have I. Yes, indeed I have. I enjoyed the talks as well as chatting with people I've met at other conferences and who I know from almost 2 decades I've been involved in open-source. So I'm happy to come back to Berlin and hope to have even more fun.
Is there anything else you want to tells us?
There will be three of us from Sematext giving talks at Buzzwords this year (that's almost 50% of the company! ;)). Two of these days will include information about technology, issues, and solutions we've come up (some of them we've open-sourced) while building Search Analytics and Performance Monitoring services. These services are free at https://apps.sematext.com/ . Thus, this may be a good time for people to try out these services and use the opportunity to ask us any questions about them they may have in Berlin, where we'll also be exhibiting.
Markus Weimer
Could you briefly introduce yourself?
I'm a Principal Scientist at Microsoft's Cloud Information Services Laboratory. By training, I am a machine learner and did my PhD on recommender systems and their use in the educational domain at TU Darmstadt, co-supervised my Max Mühlhäuser and Alex Smola. Today, my work focusses on solving the two critical interface problems large scale machine learning faces: The one between the world and machine learning (feature extraction, example formation, APIs) and the one between machine learning and distributed systems. My talk at Berlin Buzzwords will focus on the latter aspects. You will also most likely see me with a camera in hand most of the time, as I am an avid photographer.
How did you get started developing software?
At the tender age of eight, abusing my brother's C64 for the most amazing program ever: 10 PRINT "HELLO WORLD!"; 20 GOTO 10 It has gone downhill in utility since then: A mechatronics system for a hybrid bike (hybrid between you and an electrical engine), mining optimization software, recommender systems, spam filtering systems, large scale distributed machine learning systems. But that first one was the most fun;)
What will your talk be about, exactly?
Large scale, distributed machine learning today suffers from one of two poisons: Either it is written in Hadoop MapReduce, which means it is inefficient due to the poor fit of Hadoop for iterative algorithms. Or it is fast, but build on a non-standard, hard to code-for foundation. I will present one possible way out of this dilema based on a runtime not unlike Hadoop, but with iteration support and a language not unlike Pig, but with loops.
When did you start contributing to Apache projects?
*blush* This is the question where you'd expect a long list of great contributions to Apache. However, I am currently not a committer for any project. And given that my greatest contribution to Hadoop so far has been to uncover a critical bug by bringing down a several-thousand node cluster at Yahoo!, that is probably intentional;) Seriously: We are committed to Apache Open Source for the project I am presenting. So: Stay tuned and invite me again. I might have a proper answer for this question next time around.
Have you enjoyed previous Berlin Buzzwords editions?
Yes. I found the time to talk to all kinds of folks from the Bay Area in Berlin, ironically.I am looking forward to this edition. Always good to come home (sort off, I never lived in Berlin).
Machine learning is still some sort of magic and black art today. What do you
think is needed to also make your average Joe developer understand how
to apply machine learning to his problems, which algorithms to chose
how to pre-process data?
First of all: There are plenty of awesome online lectures available on machine learning. Alex Smola's series at UC Berkeley specifically adresses large scale machine learning, and Andrew Ng's Stanford class is probably the best intro available. So step one from my perspective would be: Spend some quality time watching those awesome teachers. Now to your question: For several classes of machine learning algorithms, cookie cutter solutions are now within reach and we should be able to provide those. Also, lots of problems with the data can be uncovered automatically. It's just that noone has done the work to write all the tools necessary for proper and semi-automated pre-processing. I am confident that this will happen.
Several machine learning algorithms aren't a particularly good fit for Apache
Hadoop. Which ones stand out in being well suited for Hadoop, which ones
are particularly hard to implement in a scalable and efficient way?
Just about any machine learning algorithm is a terrible fit for Hadoop-*MapReduce*. Notice the highlight: There are plenty aspects of Hadoop that are perfect for Machine Learning: Data-Local Scheduling, Distributed Filesystem, Fault Tolerance, ... MapReduce is indeed a great fit for lots of machine learning algorithms, but not as implemented in Hadoop: The lack of optimizations for iterative workloads means that Hadoop is about 30x slower than it should be (the 30x is not random, it is the speedup that all academic papers claim over Hadoop). The only algorithm I can think of that is a great fit for Hadoop-MapReduce is Naive Bayes: It essentially performs WordCount and only needs a single pass over the data.
If you were the god of Apache Hadoop - what would you change?
My wish has already been granted: YARN. With the new resource manager, we will be able to share the same cluster between MaoReduce jobs and those that use alternative frameworks optimized for machine learning tasks.
Is there anything else you want to tells us?
A word of advise: Different from popular beliefs held in California, Berlin is not always frozen. It actually gets pretty hot in summer. So bring T-Shirts as well as your Ski Suits:)
Anne Veling
Could you briefly introduce yourself?
I am Anne Veling, a freelance software architect helping large corporations design and build internet applications that often include search. Have been involved in the Lucene community for a long time, giving Lucene workshops. Married, proud father of 3, into improvizational theatre to vent my mind always full of ideas.
How did you get started developing software?
I used to look over the shoulder of my brother, doing pair programming back in 1985. Interested in mathematics and artificial intelligence, inspired by Godel Escher Bach from Douglas Hofstadter. Worked at several Internet startups in the search engine arena, and have worked as a performance/scability troubleshooter for 6 years before starting my own company. I'm always programming many ideas at once, both for work and on my own time (finishing that last 20% of the apps is my main bottleneck... ;-)) Interested in programming languages, collecting them and always trying out new tools, including obfuscated programming languages and creating my own compilers.
What will your talk be about, exactly?
I'll talk about how we implemented the autocomplete functionality for the Dutch public transport site 9292.nl that tries to recognize what address/station a user means when he types into a single edit box. We could have done that traditionally with a single concatenated field on Solr, but we did something different here, using the many different fields in our database to its advantage, and making use of the incredible speed of Lucene to recognize the different meanings of the words in context. This has allowed us to tweak the algorithm much better than we could otherwise, in a highly ambiguous data space.
Have you enjoyed previous Berlin Buzzwords editions?
No I have not, though I've heard great stories from friends of mine who have attended last year.
Is there anything else you want to tells us?
Looking forward to meeting clever people at Berlin Buzzwords and discussing the "future of the internet" over coffee, dinner and beer :-)
Tim Lossen
Could you briefly introduce yourself?
hi, my name is tim, i work as a backend engineer at wooga. i am fascinated by technology and always like to check out new stuff. my favority programming language is ruby, because it is very flexible and never gets in my way.
How did you get started developing software?
i raided my savings book when i was twelve and bought a commodore 64, starting out with basic.
What will your talk be about, exactly?
i will explain why kafka is ideally suited as a central "event bus" which enables near-realtime processing of the event stream, and how we use it at wooga as the core of our new tracking infrastructure.
Have you enjoyed previous Berlin Buzzwords editions?
yes, it is a good place to learn about hot emerging technologies. actually, i discovered kafka at buzzwords last year!
What was the first Apache project you got in touch with?
tomcat, i guess.
Why did you choose Kafka for event stream processing at Wooga?
because it is based on a very simple design and concentrates on doing only one thing really really well.
If you were the god of Kafka - what would you change?
i would change the name, because it is super hard to google. "kafka download" will list dozens of audio books, for example ...
Is there anything else you want to tells us?
looking forward to meet you all in june!
Alex Baranau
Could you briefly introduce yourself?
I'm a software engineer passionate about big data, search, analytics, mathematics, solving complex problems and sharing findings via articles & presentations.
How did you get started developing software?
Initially I started with Big Enterprise projects, but later switched the focus to Enterprise Search and Big Data fields.
What will your talk be about, exactly?
In this talk we’ll explain how we implemented “update-less updates” (not a typo!) for HBase using append-only approach. This approach uses HBase core strengths like fast range scans and the recently added coprocessors to enable real-time analytics. It shines in situations where high data volume and velocity make random updates (aka Get+Put) prohibitively expensive. Apart from making real-time analytics possible, we’ll show how the append-only approach to updates makes it possible to perform rollbacks of data changes and avoid data inconsistency problems caused by tasks in MapReduce jobs that fail after only partially updating data in HBase.
Have you enjoyed previous Berlin Buzzwords editions?
Absolutely, I visited all of them. I was attendee in 2010 and did a lightning talk the following year.
Is there anything else you want to tells us?
Everything is organized very well! Thank you!
Grant Ingersoll
Could you briefly introduce yourself?
My name is Grant Ingersoll. I'm the Chief Scientist at Lucid Imagination, as well as one of the company's co-founders. I've been a long time contributor and committer to Apache Lucene and Solr as well as a variety of other open source projects. I've been a developer for 15+ years now, most of it focused on search and natural language processing, but I did spent my early career working on parallel and distributed simulations.
How did you get started developing software?
My first paid programming job was working on distributing computational electromagnetics simulations in FORTRAN for a small company based in Syracuse, NY. Luckily, I was able to work on a variety of different things for that company, ranging from video conferencing to building out distributed systems on linux. After that, I was fortunate to find a job working on a cross language information retrieval system that allowed users to search French, Spanish and Japanese content using English. It was a really cool introduction into the world of search and Natural Language Processing and one that has stuck with me to date.
What will your talk be about, exactly?
My talk will be focused on the technical aspects of building out large scale search and discovery solutions using tools like Solr, Mahout and Hadoop. Each of these technologies bring a distinct set of capabilities to the table and combining them can produce some interesting capabilities for end users, businesses and developers alike.
When did you start contributing to Apache projects?
I'd say my contributions started in earnest around 2004 or 2005 on the Lucene project, but it seems so long ago, I'd have to go back and look in the commit logs to be certain!
Have you enjoyed previous Berlin Buzzwords editions?
I was at the inaugural Buzzwords and really enjoyed both the city of Berlin as well as the conference.
Can you share some particulary interesting analysis project you know of that
has used Apache Mahout?
That's a tough one, as there are a lot of interesting applications out there. Many people know of Mahout for recommendations and that is clearly it's most popular area, but I have seen some really interesting projects around clustering and classification. On the classification front, I know of one ad targeting company who is using it at very large scales where it significantly outperforms others. I also know of a company who is using clustering to help users better find related content (mostly stock images.) At my company, Lucid Imagination, we use it for clustering and for extracting statistically interesting phrases and are looking to it for other things in the near future.
What is your major goal as a committer to Apache Mahout? Which areas do you
think need most attention?
I'd say the major goal is to get to a stable, 1.0 release that has the polish necessary to make Mahout more consumable by a wider audience. To some extent, this means cleaning up APIs and solidifying our tool sets, but it also means cutting some things that haven't proven as useful.
Many of the nowadays buzzwordly talks have come from the Apache Software
Foundation. What do you think makes Apache projects so successful in
particular for communities developing complex software?
What are the risks of going Apache?
I think the answer lies in your question: community. A good community can significantly move a project forward in ways that an individual or company simply cannot. Sure, sometimes there are public disagreements about direction or releases aren't always as predictable, but in the long run, you usually end up with better code due to the openness of the project and the sheer number of people looking at the code.
Is there anything else you want to tells us?
Thanks for putting on a great conference and giving me a chance to be a part of it. I look forward to being in Berlin in a few weeks.
Leslie Hawthorn
Could you briefly introduce yourself?
Hello, world! I'm Leslie Hawthorn and I currently work on Community Action and Impact at Red Hat, Inc. I've been working in the free and open source software world for just over six years, largely focusing on community management, people processes and bringing new contributors into FOSS projects. In a past lives, I've worked as Outreach Manager for the OSU Open Source Lab, home to Drupal, the Apache Software Foundation and many other well known FOSS projects. I also managed the Google Summer of Code program for more than four years and created the contest now known as Google Code-In. When not focused on all things open source and community related, I enjoy organic gardening, learning more about permaculture and cooking fabulous meals with foods harvested from my back yard. I live in Portland, Oregon, US with my two cats and limit myself to no more than two Powell's Books visits per month for the sake of my budget.
How did you get started developing software?
I actually don't develop software. Well, those 800 or so odd lines of Python I have under my belt probably don't count. I focus on the human angle to software development, specifically how to help teams work together to get things done, how to make sure that open source projects are most accessible to newcomers and how to make open source thrive in a wide variety of areas, from academia to humanitarian efforts. I have a particular interest in mentoring programs and specifically mentoring for youth who wish to get involved in FOSS.
What will your talk be about, exactly?
I'll be talking about Community - what it is, what it isn't, how to make your open source software project most welcoming to newcomers and how to build communities that last. I'll be drawing on the observations I've made after working with hundreds of FOSS projects, sharing community building best practices with the audience. I'm also planning to talk about a few of my favorite FOSS projects who've built tremendously successful communities as case studies for "community management done right.
Have you enjoyed previous Berlin Buzzwords editions?
This will be my first Berlin Buzzwords. I am greatly looking forward to it!
Is there anything else you want to tells us?
Thus far, all of my travels to Germany have involved catching connecting flights at Hamburg airport. So, I've been to Germany four times now, but have yet to see any of the country. Can't wait!
Andrzej Bialecki
Could you briefly introduce yourself?
I'm married, with two kids, and I live in Poland (near Warsaw). I'm a freelancer though most of my time nowadays is spent working with Lucid Imagination on its Solr-based platform LucidWorks. I'm also a hobbyist guitar and keyboard player, singer and arranger.
How did you get started developing software?
I started programming while at university (I studied Electric Engineering), and have been doing it ever since. I got involved with Lucene and Solr because of a failed project to implement a desktop search application - Lucene-based part was doing fantastic job, but the management of that company wasn't ;) For the last 6 years I've been involved exclusively in various Lucene, Nutch, Hadoop and Solr projects.
What will your talk be about, exactly?
The problem of small updates to large multi-field documents. Today there is no easy way to do this in Lucene without re-indexing the whole content of a document. I'm going to present a design that provides this functionality, and reuses much of the existing code and concepts already implemented in Lucene.
What do you hope to accomplish by giving this talk? What do you expect?
I'll present an interesting solution to a particular weakness of Lucene, so I hope to raise some interest among developers, and I expect that some of them will want to help with the implementation. Until now this particular problem was thought to be so hard to implement that nobody wanted to tackle it. I'm going to show that it's not that hard (not trivial either, but certainly achievable!).
In 2010 the Apache nutch project started concentrating on web crawling mainly. What is the current status of your crawler? What are the main areas of focus right now?
Nutch is being actively developed, a new release is planned in the next couple weeks. There is an experimental line of development that uses another Apache project, Gora, to manage the crawl data using any of the NoSQL platforms supported by Gora - HBase, Cassandra, Accumulo, and even Solr as a key-value store.
If you were the god of Apache Lucene - what would you change?
No need, the angels - committers and contributors - are already doing excellent job! And more seriously, the Lucene community is one of the most mature, balanced and friendly communities I ever worked with. Even if egos clash from time to time, the process of resolving such situations is in place and works quite well. During the last 1-2 years Lucene has seen tremendous growth in terms of improvements, innovation and increased functionality. It's an exciting project to be involved in.
Friso van Vollenhoven
Could you briefly introduce yourself?
I am Friso. I build software for a living and in my spare time I sometimes talk about it. Other than that I can mostly be found on sailing boats.
How did you get started developing software?
I studied software engineering. I like building things. It was the obvious thing to do.
What will your talk be about, exactly?
About the tools and frameworks we use to make the combined effort of engineers and domain specialists to do network analysis work. These tools include: Hadoop, Cascading, Neo4j and a JavaScript InfoVis Toolkit.
Have you enjoyed previous Berlin Buzzwords editions?
Very much!
Comparing Apache Hadoop's HDFS and HBase - what are the benefits of HBase?
When should developers decide to use HBase?
Ehm... It's complicated. As a guideline: lots (billions) of small objects / records / files, go HBase. Not so lots of big (huge) files, go HDFS. Beyond that there is a whole story about random write access, different layers and types of caching, operational burden and more...
What is the most challenging problem or analysis task you ever had to
solve at scale?
Processing the flow of measurement data that is produced by RIPE NCCs routing information service (https://www.ripe.net/data-tools/stats/ris/routing-information-service).
If you were the god of Apache Hadoop - what would you change?
Hadoop has given engineers the ability to convince management folk that a solution is robust and scalable by adding a logo of a little yellow elephant to the documentation. As far as I am concerned, that is already perfect.
Is there anything else you want to tells us?
Have fun!
Bertrand Delacretaz
Could you briefly introduce yourself?
I'm a Senior developer in the CQ5 R&D team at Adobe, which creates content management and digital marketing products based in a large part on Apache projects like Jackrabbit, Sling and Felix. Apache Software Foundation member, active in several projects including the incubator and a few terms on its board of directors. Server-side Java guy, OSGi and JCR advocate, focusing on creating testable, transparent and long-lived software systems.
How did you get started developing software?
I started with building electronics (fun stuff like audio power amps and drum machines) as a hobby when I was 16, and gradually moved to microprocessors and embedded systems, created a MIDI interface for accordions in 1983, and gradually moved to full-time software activities. My first computer was either a Sinclair ZX81 or the Micro-Professor, a FORTH system that was quite fun to hack. Forgot which one came first. Got my first Internet connection in 1993 via a dialup line to a Sun workstation. I'll never forget the first time I pressed ENTER on an hypertext link, in a text-only browser.
What will your talk be about, exactly?
I'm going to show how a JCR-based content repository allows you to create clean, self-explaining and evolvable content structures, by describing a number of content model design patterns and examples.
Have you enjoyed previous Berlin Buzzwords editions?
This will by my first Berlin Buzzwords (my first time in Berlin in fact) and I'm very much looking forward to it!
NoSQL is a highly under-defined term including anything from "easy to use" to "web scale" data amounts. Where does JCR fit into this picture?
While current JCR implementations might not be as scalable as other NoSQL databases, JCR stands out by the rich application services that it provides: content modeling, observation callbacks, versioning, scriptable RESTful interfaces via Apache Sling and more. A JCR repository is not only a content store, it provides a large part of the infrastructure services that you need to build content-based applications, so you'll usually need to write much less code than with other environments. The inherent decoupling that JCR promotes at the content level makes a huge difference in the efficiency of our development teams, especially when combined with the the high modularity that OSGi provides via the Apache Sling applications layer.
Many of the nowadays Buzzwords talks have come from the Apache Software Foundation. What do you think makes Apache projects so successful in particular for communities developing complex software?
The ASF is a neutral place where people can collaborate in an open way, based on more than 15 years of collective experience in doing that. Our governance model and best practices help project members reach consensus, and our well-known license and release model, combined with the fact that no one besides the foundation owns an Apache project, helps make our projects sustainable in the long term. The ASF is very careful about avoiding any corporate influence, which helps our projects keep their focus over time.
What do you think are the risks when turning your pet (free software) project into an Apache project?
I think the Apache Incubator needs improvements in terms of better explaining what it takes to become an Apache project, what the expectations are regarding project members, and how much freedom projects have, as opposed to a common perception that Apache imposes a lot of things on its projects. As a result, the risk for incoming projects is misunderstanding the context in which they evolve, and losing time and energy fighting the wrong fights. The incubator team (which consists of volunteers, as with all areas of the foundation) is well aware of these issues and working on improving the situation. The best way to avoid these risks is to make sure ASF members with enough available time and energy are available to act as the new project's incubation mentors.
Is there anything else you want to tells us?
First I think these interviews are a great idea as a way to allow the audience to get to know the speakers better, and secondly I'd just like to add a small disclaimer that the above is purely my personal views and opinions.
Ioan Eugen Stan
Could you briefly introduce yourself?
I'm a 26 year old Romanian free-software and open technology enthusiast that aims to make software more human accessible. My main interests revolve around distributed applications, information retrieval and big data.
How did you get started developing software?
My first contact with a computer was in 6-7'th grade, at a local "IT" club, continued in high-school, took a break furring collage and then decided to start again.
What will your talk be about, exactly?
I wish to present the general architecture of Apache James and introduce each component. Talk about how people are using each component in their apps. I'm going to conclude with hbase mailbox implementation and what it brings to the table. My emphasis will be on how you can scale James to handle a lot of traffic and store a lot of data. I'm planning to write a demo app using play-framework and James that shows you could store web site comments in James mailbox. I hope I get things ready on time, if not I will exclude it.
Have you enjoyed previous Berlin Buzzwords editions?
No, this will be my first edition.
Is there anything else you want to tells us?
Continue doing a great job.
Frank Scholten
Could you briefly introduce yourself?
I am Frank Scholten, I am a user and contributor to Apache Mahout and a committer at Apache Whirr. I am a software developer at Orange11 (formerly known as JTeam) in Amsterdam and I live in Utrecht.
How did you get started developing software?
I have been an Java developer for around 5 years working on web and integration projects using all sorts of open source frameworks.
What will your talk be about, exactly?
In my talk I will show you how to deploy the Apache Mahout machine learning library via Apache Whirr. Whirr is a library and tool for quickly setting up cloud services such as Hadoop, Zookeeper and all sorts of NoSQL databases. I will introduce both Whirr and Mahout and explain how the Whirr Mahout service works.
When did you start contributing to Apache projects?
A few years ago I got interested in the recommendation framework Taste. Later on Taste got merged with Mahout so I also became interested that as well, particularly Mahout's text clustering capabilities. Later on I got involved with Whirr.
What is your major goal as a contributor to Apache Mahout? Which areas do you think need most attention?
My goal is to make Mahout easier to use from a developer and user perspective. This is also why I became interested in Whirr, as it enables you to setup a Hadoop cluster in the cloud very easily. Creating the Mahout service for Whirr was the next natural step.
Apache Mahout claims to provide scalable machine learning algorithms. How exactly do you define scalability? Where is the project being used already?
Mahout's scalability is related to the size of the dataset it can handle. Many components such as recommendation and clustering have both sequential and MapReduce implementations so you can apply them to smaller and larger datasets. There are several projects where Mahout is used in production, the recommendation framework especially is quite popular, although the use of Mahout is sometimes kept confidential in some organizations. I have used Mahout's text clustering features and more generic statistics components in projects.
Is there anything else you want to tells us?
I am looking forward to a fun an interesting conference in Berlin and love to know how other people use these technologies. Cheers and thanks to everyone who is involved in creating the 3rd edition of Berlin Buzzwords!
Robert Muir
Could you briefly introduce yourself?
Robert Muir is both a Lucene/Solr PMC Member as well as a Lucene/Solr Committer. He earned his BS in Computer Science from Radford University and a MS in Computer Science from Johns Hopkins University. Prior to Lucid Imagination, Robert worked for Ntrepid Corporation.
How did you get started developing software?
I've been doing work related to search nearly my entire career. It wasn't even by choice (though its fun!), just one of those problems you see everywhere.
What will your talk be about, exactly?
I will be discussing all the different ways we are using finite-state automata to introduce sizable performance and memory gains, as well as improved capabilities across many different use-cases in Lucene: not just the inverted index, but also things like auto-suggest, spell correction, and Japanese text analysis. A lot of these areas have been traditional weaknesses of Lucene but now are rapidly improving! Transducers are powerful, fast, and compact solutions to many of these problems, but despite the fact the ideas have been around for a long time, they really haven't made gotten much use in the Java world. I think we have a really solid practical implementation in Apache Lucene and I'm hoping the talk might also inspire developers from other projects to take a look (or maybe borrow/use!) what we have done.
When did you start contributing to Apache projects?
I started contributing back to Lucene over three years ago, after working and customizing Lucene for a long time to search text in many different languages.
Have you enjoyed previous Berlin Buzzwords editions?
Yes, I didn't make it last year, but the first one was great!
From everything that people are right now working on in Apache Lucene - in your opinion what will be the most interesting new developments and changes in the next major release?
As far as what everyone is working on: there is a lot going on! I think interested folks can get a good glimpse of this by looking at the huge variety of Lucene-related talks in the program. The next major release (4.0) introduces the ability to select from, or even customize your own underlying index format and data structures. We also introduce support for pluggable relevance ranking models beyond just the vector space model.
What are you currently working on in Apache Lucene? Which areas would you like to improve, speed up or make easier to use?
Currently I am only working on the boring stuff: documentation, packaging, tests, etc. The next release (4.0) introduces major changes and there is a lot to get in shape here so that we can get it out. Occasionally I take the time to work on a bug here or there too, just so I don't forget how to code. After this stuff is in good shape I look forward to continuing some work on things like autosuggest.
Anything you are planning to hack on during Berlin Buzzwords? Where should attendees interested in Lucene try to meet you and ask questions?
I imagine I'll be hanging around the other Lucene hackers like Simon Willnauer, Grant Ingersoll, Michael Busch, Andrzej Bialecki, Otis Gospodnetić, Uwe Schindler, Christian Moen, and Martijn van Groningen. We will probably find something to tackle over the course of the conference!
Rafał Kuć
Could you briefly introduce yourself?
My name is Rafał Kuć, I'm a software engineer and consultant living in Bialystok, Poland near the east border. I work for Sematext as a software engineer focused on search, information retrieval and big data. I'm also author of "Solr 3.1 cookbook" and co-founder of solr.pl.
How did you get started developing software?
I started long time ago, during my college days where I developed some simple games. That's how it all began. I came across Lucene during the time when I was studying and was working as a software developer on content management and logistics systems. It was the time, when I decided to go for search and information retrieval. Since then, I try to concentrate on those topics and I finally can :)
What will your talk be about, exactly?
My talk will be all about ElasticSearch. I'll start with the problem of shard allocation, preventing nodes from being hot and preventing uneven shard distribution. I'll talk how to use routing and how to use nodes that are only there to route the data to the rest of the cluster. Attendees will also be able to hear about handling multilingual content (during both query and indexing time), about improving query performance of ElasticSearch. I'll also try to show how we overcome some problems we got when dealing with high amount of data in the cluster line problems with cache size, and one related to JVM and garbage collection. The final part of the talk will concentrate on sharing our knowledge about performance testing of ElasticSearch clusters and monitoring them.
Have you enjoyed previous Berlin Buzzwords editions?
I'm afraid it'll be my first time at Buzzword this year, but hopefully not the last one :)
Is there anything else you want to tells us?
I've never been to Berlin Buzzwords, but from seeing the topics of talks I see that it'll be amazing. Keep it up like this :)
Eric Evans
Could you briefly introduce yourself?
My name is Eric Evans, I am a long-time Free Software hacker living in San Antonio Texas. I work on distributed systems, including Apache Cassandra, for a London-based start-up called Acunu.
How did you get started developing software?
I used to work with control systems used in industrial automation. Over time, these systems made increasing use of single-purpose controllers programmed with proprietary tools, and eventually, commodity PCs. My interests followed this trajectory until realizing all of the other interesting things computers could be used for.
What will your talk be about, exactly?
My talk will focus on Castle, and how we use it with Cassandra. Castle is an open source replacement for the Linux storage stack. It manages a set of disks to provide aggregation for performance and redundancy (think RAID), implements full versioning, including support for snapshots and clones, and is write-optimized for big data workloads. Like many next-gen databases, Cassandra itself is write-optimized, but it's LSM-tree is an almost pathological worse-case for the JVM's garbage collector. Among other things, replacing Cassandra's storage with a Castle-based backend moves problematic memory management into the Linux kernel, virtually eliminating GC-related performance issues in the process.
Legend has it that the term NoSQL was created in a brief chat about a meetup at the Hadoop summit by Johan Oskarsson and yourself. Did you expect the impact if had afterwards? Did you ever wish you had defined it differently?
Johan had planned an entire meetup before choosing a name for it, that was the final step. As luck would have it I just happened to be in the IRC channel where he asked for ideas, and "NoSQL" stuck for some reason. It's still surreal to me that something so important to so many people could get it's name because of something I (stupidly) blurted out on IRC. Many have said that the controversial name has drawn valuable attention to important projects, and that's probably true, but it's also hurt to have them categorized by something they do not share in common. Dogs, hammers, and corn for example are all things which cannot fly, but it would be absurd to categorize them as NoFlight. NoSQL is not much better I'm afraid.
Many of the nowadays buzzwordly talks have come from the Apache Software Foundation. What do you think makes Apache projects so successful in particular for communitites developing complex software? What are the risks of going Apache?
I think what makes the ASF unique is how comprehensive it is. It's licensing and legal counsel, project hosting infrastructure, and mentoring, all rolled into one. For a new project with code, but no idea what to do with it, this is powerful combination. I sometimes worry though about what seems to be a single-mindedness with regard to the Apache Way. I think this can be off-putting to newcomers, and unnecessarily disruptive to an otherwise working project.
What is the major area of focus today for Apache Cassandra?
For the first couple of years it was all about features and performance. That work will never be done, but the project has achieved most of its goals, and the focus is increasingly shifting toward usability.
Is there anything else you want to tells us?
Just that I think Buzzwords is a fantastic conference. Anyone interested in storage, search, and scaling should definitely find a way to attend. Keep up the great work!